Author Archive

The following is a technique I’ve used over the last decade or so for distributing web traffic (or potentially any services) across multiple services, using just DNS.  Being an old DNS hack, I’ve called this technique Poor Man’s Anycast, although it doesn’t really use anycasting.

But before we get into the technique, we need to make a brief diversion into the little-known but rather neat feature of the DNS, or more accurately, DNS forwarders, which makes this a cool way to do stuff. The feature is name server selection.

Most DNS clients, and by this I include your home PC, make use of a DNS forwarder.  The forwarder is the thing that handles (and caches) DNS requests from end clients, while a DNS server carries authoritative information about a limited set of domains and only answers queries for them. These two functions have historically been conflated rather severely, mainly due to the use of BIND for both, and why this is a bad thing is the subject for a whole other post.

Moving right along. A DNS forwarder gets to handle lots of queries for any domain that its clients ask for. When you ask for foo.example.net, it asks one of the root servers (a.root-servers.net, b.root-servers.net et al) for that full domain name (let’s assume it’s just come up and doesn’t have anything cached).  It gets back a delegation from the root servers, saying basically, “I don’t know, but the GTLD (.com, .net) servers will”, and tell you where to find the GTLD servers (a.gtld-servers.net et al).  You ask the one of the GTLD servers, and get back an answer that says that they don’t know either, but ns1.example.net and ns2.example.net do.

You then ask (say) ns1.example.net, and hopefully you’ll get the answer you want (e.g. the IP address).

Now, along the way, the forwarder has been caching everything it got. Every time it asks a name server for data, it stores the time it took to reply. That means that when looking up names in example.net, the forwarder has been collecting timing and reliability data which it uses to choose which name server to ask next time, as well as the answers it received.  So if ns1.example.net answers in 20 ms, but ns2.example.net answers in 10 ms, roughly two thirds of the queries for something.example.net will be sent to ns2.example.net. If the timing difference is much greater, the split of queries will be even more marked. Similarly, if a name server fails to respond at all, that fact will be reflected in the accumulated preference assigned to that server, and it will get very few queries in future; just enough so that we know we can start sending it queries again when it comes back.

This is a powerful effect, and is of particular use when distributing servers over a wide geographical area. DNS specialists know about it, because poor DNS performance affects everything, and DNS people don’t like adversely affecting everything. (They’re really quite paranoid about it. Trust me, I’m one.) But it can also be used to pick the closest server for other things as well.

After all, closeness (in terms of round-trip time) is very important in network performance (see my post on bandwidth delay products).

The technique is as follows.  Let’s say we have three web servers, carrying static content. Call them, say, auckland.example.net, chicago.example.net and london.example.net. Let’s say that they’re widely disparate. All three servers carry content for http://www.example.com/.

So, we start by configuring, on the example.com name servers:

$ORIGIN example.com.
$TTL 86400
www     IN      NS      auckland.example.net.
        IN      NS      chicago.example.net.
        IN      NS      london.example.net.

We then run a DNS server on all three web servers.  We configure the servers with a zone for www.example.com along the lines of:

$ORIGIN www.example.com
$TTL 86400                       ; Long (24 hour) TTL on NS records etc
@       IN      SOA     auckland.example.net. webmaster.example.com (
                                2009112900 3600 900 3600000 300 )
        IN      NS      auckland.example.net.
        IN      NS      chicago.example.net.
        IN      NS      london.example.net.
$TTL 300                         ; Short (five minute) TTL on A record
@       IN      A       10.0.0.1 ; Set this to host IP address

Now the key is that each web server serves up its own IP address. When a DNS forwarder makes a query for a www.example.com, it will be directed to one of auckland.example.net, chicago.example.net or london.example.net. But as more and more queries get made, one of those three will start handling the bulk of the queries, at least if that one is significantly closer than the other two. And if auckland.example.net gets the query, it answers with its  own IP address, meaning that it also gets the subsequent HTTP request or other services directed to it. The short DNS TTL (5 minutes in the example) mean that the address gets queried moderately often, allowing the name server selection to “get up to speed”. Much longer TTLs on the name servers mean the data doesn’t get forgotten too quickly.

The result is that in many cases, the best server gets the request.

The technique works best if there are lots of domains being handled by the same set of servers, and there are lots of requests coming through. That way the preferences get set quickly in the major ISPs’ DNS forwarders. The down side of the technique is that far away servers will still get some queries. This non-determinism may be a reason for not deploying this technique.  If you want determinism, you’ll need to look at more industrial grade techniques.

Now, this isn’t what players like Akamai do, and it isn’t what anycasting is about. Akamai and (some) other content distribution networks work by maintaining a map of the Internet, and returning DNS answers based on the requester’s IP address. But this is a fairly heavyweight answer to the problem. It’s not something you can implement with just BIND alone.

Anycasting on the other hand relies on advertising the same IP address in multiple places, and letting BGP sort out the nearest path. This has three disadvantages:

  1. It potentially breaks TCP. If there is an equal cost path to a given anycast node, it’s possible one packet from a stream might go one way, while the next packet might be sent to a completely different host (at the same IP address). In practice, this has proven to be less of a problem than might be expected, but there is still scope for surprises.
  2. Each of your nodes has to be separately BGP peered with its upstream network(s). That’s a lot more administration than many ISPs will do for free.
  3. Most importantly, being close in BGP terms is not the same as being close physically or in terms of round-trip time. Many providers have huge reach within a single AS, so a short AS-path (the main metric for BGP) may actually be a geographically long distance, with a correspondingly long round-trip.

The other nice thing about poor man’s anycast is that it’s dynamic; if a node falls off the world, as long as its DNS goes away too, it’ll just disappear from the cloud as soon as the TTLs time out. If a path to it gets congested, name server selection will notice the increased TTL and de-prefer that server.

And of course you don’t need to be a DNS or BGP guru, or buy/build expensive, complex software systems to set it up.

I found myself explaining this one at Curry tonight, in the context of discussing fast broadband.

Basically, if you have a reliable stream protocol like, to take a random example, TCP, and you’re not doing anything imaginative with it, you run into the following problem:

Every byte you send might need to be resent if it gets lost along the way.  So, you buffer whatever you send up until you get an acknowledgement from the other end.  Let’s say, for argument’s sake you use a 64k buffer. We call this buffer the window, and the size is the window size.

Now, let’s say you have a looooonnnngggg path between you and your remote endpoint. Let’s say it’s 200 milliseconds, or 1/5th of a second. This is pretty reasonable for an NZ-US connection — the speed of light is not our friend.

And finally, for simplicity sake, let’s say that the actual bandwidth over that path is Very High, so serialisation delays (the time taken to put one bit after the next) are negligible.

So,  if I send 64k bytes (or 512 k bits) worth of data, it takes 200 ms before I get an acknowledgement. It doesn’t matter how fast I send my 64k; I still have to stop when I’ve sent it.  200 ms later, I get a bunch of acknowledgements back, for the whole 64k (assuming nothing got dropped), and I can now send my next 64k.

So the actual throughput, through my SuperDuperBroadband connection, is 64k bytes per 200 ms, or 2.5 Mbps.

To turn this around, if I want 2.5 Mbps at 200 ms, I need a 64k byte window; if I want 5 Mbps on a 200 ms path, I’m going to need to up the window size to 5 Mbps times 200 ms = 128 k bytes.

That window size calculation is the bandwidth delay product.

There’s ‘s the theory. Pick a big window size and go fast. Except:

  1. You don’t get to pick. Even if you control your application, for downloads you can ask for a bigger window size, but you don’t necessarily get it. Probably, you’ll get the smaller of what the applications at either end asked for.
  2. Standard, 1981 edition TCP has the window (buffer) size that can be communicated by the endpoints maxed out at 64k. This isn’t the end of the world; in 1992 Van Jacobson and friends rode to the rescue with RFC 1323, which allows the window size to be scaled, to pretty much anything you like. But most TCP stacks come with a default window size in the 64k-ish range, and may applications don’t change it.
  3. Even if both ends of a TCP session ask for and get a large maximum window size, they don’t start with it. TCP congestion control requires that everyone start slowly (it;s called slow start), and this is done by starting with a small window size, and increasing it as the acknowledgements flow in and the sending end can get an idea about how much bandwidth there is.  So if your application uses lots of short TCP sessions rather than one long one, you’ll never reach your maximum window size and therefore never saturate your connection.

What to do? It depends what you’re trying to achieve.  For file transfers, run lots of TCP sessions side by side – can anyone say BitTorrent? Local caching helps for web traffic; move the content closer, and the bandwidth delay product is less. Use a different protocol. I have to say I’ve seen quite a few UDP-based file transfer protocols come and go, because tweaking TCP parameters at both ends is usually a darn sight easier than getting a new protocol right. (see don’s law).

What it comes down to, is that if all you’re going to use your UltraSuperDuperFast broadband connection for is downloading videos from US servers, you’re going to be disappointed. The real key to making this useful is local, or at least, locally hosted, content. Preferably located right by the fibre head-ends. It’s a parallel stream to the effort to get the fibre in the ground and get it lit, and it needs to be attended to PDQ.

is:

“If there’s an unexpected way to implement a widely used protocol or process, someone out there has done so.”

Yes, it’s a lot like the original Murphy’s Law, “If there is any way to do it wrong, he will”, attributed to Edward Murphy, an engineer on the rocket sled tests carried out in the 1950s, in discovering that all of the accelerometers on the a test subject had been wired backwards, requiring a re-run of an expensive test.

don’s law isn’t just a re-statement of Murphy’s Law, or even Sod’s Law (“if it can go wrong, it will”).  It’s a recognition that there are a lot of implementations of stuff out there,  some by people who don’t think the way you do. (Or I do.  Or, in some cases, like any sane individual.) And the way those implementations work, especially under exception conditions, can vary enormously.

… is why I’ve started this blog.

Yes, it’s an ego thing.  A platform to spout the random things that spring to mind, when they’ve sprung.