Archive for the ‘Pushing packets’ Category

Back in the Dark Ages, when dinosaurs ruled the earth … yeah, say the mid 1990s, early ISPs tended to offer “free” email service as part of their connection plans. It was cheap to do; the email usually just took the form of a POP email box, via which you downloaded your email with a client such as Eudora or POPmail for those reprobate MS-DOS users who loved their text-mode clients.

Your email address was usually something like your-dialup-username@the-isp’s-domain-name, e.g There were a bunch of reasons for this.

  1. Email was an early “killer app” for the Internet. Giving the customers an email address got them on and able to do something useful with the ‘net, back when the Web was still in its infancy.
  2. Domain name services were offered as a premium service, and were often expensive in terms of the effort and domain name fees required to provide; corporate customers with permanent connections would usually provide their own email service.
  3. Having the ISP’s domain name in all its customers’ email addresses provided brand recognition.
  4. The requirement to change email address created a disincentive for customers to change provider.

The world has changed a bit since then.

There are a bunch of email providers, like Hotmail, Yahoo! and GMail, which will happily give you an email address, and a very nice web interface, which you can use to get at your email from anywhere. There’s absolutely no need to get yourself tied to a specific provider. (There is one caveat though, and that’s that if you’re not paying for the service, you are not a customer.)

Point 1 no longer applies. You don’t need the ISP’s email service.

There are now many commercial domain hosting services available. Granted, they are not free, although many are cost effective, but these provide good email services, including hosted IMAP service (far superior to the old POP service, which assumed you’d only ever get your email on one computer), server-side filtering, spam removal and so-forth, as well as web hosting options. The days of the ISP manually configuring a “virtual domain” onto its web and email servers, and charging a premium price for it, are long gone.

The game of providing email has changed.The service isn’t a case of holding mail in a temporary spool for later download by a single desktop computer. A decent email service stores, and backs up, all email, so that it can bet retrieved from multiple desktop, portable and mobile clients. Spam processing is a major drain on resources; many folk don’t understand that it’s war out there – spam is driven by large commercial interests who pay highly organised criminals to spam, and to attack computers to create the means to spam. So not only do you not want to be the target for these gangs, not being that target is actually cost effective. Automation makes configuring domain names, email and web hosting easy and cheap for suitably organised providers, and domain name registration fees are down to very low prices. For prices in the low hundreds per year, or less, you can have your own domain name, as many email addresses as you need within it, and a smart web host running easy to operate software (such as WordPress, which I’m using to write this).

The last two reasons for ISPs providing email are for the their benefit, not yours. They get the brand recognition. They get to keep you as a customer, or at least on their customer list, for long after their use-by date has passed.

Email has never, ever, been a “free” service; somewhere, somehow, the providers of the service have been making a buck out of it. Maybe it’s in customer retention, maybe it’s in the brand recognition. (It was Telecom Xtra’s explicitly stated goal in its early days to make “” a recognised brand.) Maybe it’s in advertising. When you buy that fancy domain / web hosting package with email? Well, the provider has probably spent as much if not more on the email part than on the web hosting part. Which brings me to a simple question.If domain hosting is so cheap, who do I still see email addresses painted on the sides of vans, on billboards and on business cards? The money you spent on that isn’t promoting your business, it’s promoting Telecom’s. Why would you do that?

What is your email worth to you?

What would you do if was no longer available?If you’re no longer a Telecom customer, you’re likely to see your email address axed in the near future, unless you pay them to keep it. If changing your address means reprinting your stationery and repainting signs, and losing email from customers that haven’t noticed that your email address has changed, that’s a high price to pay for a “free” service.

So, c’mon. In NZ, we have a domain registration system that’s the envy of the world (and I’m proud to say I’ve had a bit to do with that). Hosting your email has never been so easy or so cheap, at a time that trying to do it yourself has never been so difficult. How you present yourself to an increasingly digital world is important to how others see you, and whether they want to do business with you.

So once again, what is your email worth to you?

At NZNOG 2012, I presented our work on applying point-to-point semantics to Ethernet-like interfaces, described in my earlier post, Broadcast Interface Addressing Consiudered Harmful.

The slides are available here.

We’ve done a bit more work on this since the original article. One thing that occurred to us was that if are prepared to keep maiking ARP requests for a client, you know whether the link is alive or not. In fact, you can do ARP to a host even if you’re not really talking to it.

Consider: We have an IP host, say We tell it that its default gateway is We answer for all ARP requests the host makes, except for itself (see the earlier paper).

But now instead of one upstream router, we have two. And furthermore, we have the two routers, using and respectively as their local IP addresses, i.e. the addresses they will put in their ARP packets (and in any ICMP packets generated from the interfaces). We still tell the client its default gateway is

The two routers both ARP for the host. Both routers know if they can reach the host. Between them, via a “back channel” (i.e. a protocol running over the backbone), they agree which host should be the “active” router for that host.

The active router simply behaves as the upstream router as previously described. The inactive router does nothing more than make ARP requests for the host, and report its availability. This way, if the active router stops participating in the information protocol (i.e. dies), or the active router loses contact with the host, and the inactive router can still contact the host, the inactive router can take over the active role.

As it does this, it can generate an unsolicited unicast ARP reply to the host, to inform it that the “default” IP address ( in our example) has changed. Other addresses will sort themselves out depending on the host’s ARP caching strategy. Ideally, the client host will have a fairly rapid ARP time-out and will retry its broadcast ARP for any such addresses.

This approach has advantages over protocols like VRRP. VRRP works by changing the interface MAC address to a “shared” address, so that IP clients don’t know that there has been a change when the active router swaps over. While that makes for a potentially more rapid fail-over, it comes with a number of disadvantages:

  • The shared MAC address changes requires a change to the MAC table on layer 2 switches;
  • There is some risk of MAC address collisions, especially in Q-in-Q (stacked VLAN) configurations;
  • the VRRP protocol is visible (multicasted) on the client VLAN;

But the major advantage of this approach is that since there is a handshake with the end client. VRRP and similar protocols have no such handshake; they’re fine for detecting and replacing a failed router, but where the failed component is intervening layer 2 infrastructure, VRRP has no way of knowing that the host is not reachable from the active host, but is reachable from the inactive one. For example:

  • Switch X connects to Y, and Y to Z
  • Client C connects to switch Y
  • Client D connects to switch X
  • Router A connects to switch X, and is active for clients C & D
  • Router B connects to switch Z, and is inactive for client C & D

If the link between switches X and Y fails, Router A loses connectivity to Client C. With ARP handshaking, this loss of connectivity is detected and handled by failing over advertisement of Client C’s address to Router B. Furthermore, Client D remains reachable from Router A (and indeed connectivity is lost from Router B), but since each client IP address is processed independently, the active router for that host does not change.

We believe this is applicable to a number of situations, especially Internet access networks, be they in a data centre or layer-2 metropolitan access networks.

Juha Saarinen dropped me a note a week or two back, asking for an update to my last post, in the wake of the IANA IP address pool finally running out and the recently announced successful bid for Nortel Networks’ IP address space by Microsoft for inclusion in NZCS Newsline.

The published article can be found here, and is different enough from the previous version to warrant re-posting.

Continue reading ‘IPocalypse Now’ »

The IPocalypse is upon us. There are seven /8 IPv4 address blocks left! Soon there will be six. Then five.

On that fateful day, when the sixth to last /8 block is assigned, the five Regional Internet Registries (RIRs) will receive one each of the remaining five /8s for final allocation. This will probably happen in the next month or two.

Then there will be no more! Oh woe is us!

Or not. There are a bunch of ways that we can measure IP address space usage. They include:

  1. The number of address available. Formally, this is 2^32 minus the 588,514,560 addresses (or just over 35 /8 blocks)that  are assigned for special uses (multicast, reserved, private addressing etc), leaving 3,706,452,736 addresses (or the equivalent of just over  220.9 /8 blocks) available for present or future end-user assignment.
  2. The amount of addresses assigned by IANA to RIRs for allocation. Currently, this stands at pretty much all of the above space, less the aforementioned seven /8s (or 117,440,512 addresses).
  3. The amount of address space allocated by RIRs. According to Geoff Huston, this is likely, at current rates of assignment, to run out in mid-late 2011.
  4. The amount of address space that is actually advertised. Right now, a little under 2/3rds of the allocatable address space (that is, excluding private, multicast and reserved address space) is actually advertised to the global routing table. That’s right, 1/3rd of the IP address space is unequivocally dark.
  5. The amount of address space actually allocated to infrastructure. Now things get murky. Is a /8 advertisement actually representing a /8 worth of allocation? Or is the holder of that /8 advertising it simply because they can?
  6. The amount of address space actually in use. This too is largely unmeasurable. Many advertisements, especially smaller ones. are to achieve multihoming, in which case a /24 may have very small numbers of hosts actually assigned to it. The nature of IP address assignment is that you always have to allocate a larger subnet than you plan to use, unless you can do single IP address per client allocations, e.g. using PPP & friends, my ARP hack or layer-3 VLAN schemes.

Measurements 1 through 4 are easy. 5 & 6 are hard. All we can say for sure is that each measurement will give a smaller number of addresses in use than the one above. If an address appears on the global routing table, we can follow it to its associated autonomous system, but beyond that, we have to look ad individual addresses, and even then an assigned and in-use address my be behind a firewall or something and effectively invisible but none the less actively in play.

It did occur to me to look at reverse map entries, but experience suggests that these are unhelpful, being fairly universally badly managed.

So, the question of when IP address space will run out remains difficult to answer. Geoff’s IPv4 Address Report shows a curve in address advertisements (fig. 11c) which,although initially exponential, seems to have settled to a linear growth of about 176,000,000 addresses per year in actual advertisements since 2006. If that rate is maintained, the 1.3 billion or so unadvertised addresses should run out in about seven years.

But I suspect that as RIR space becomes unavailable, we’ll start to see address space that is currently advertised but not actually in use being re-allocated (read: sold). For starters, there are about 200 million addresses tied up in non-carrier addresses that are currently advertised as /8s. Admittedly, a goodly chunk of that space may actually be in use, but one suspects that a significant proportion isn’t. There are a lot of equally historical /16 assignments and smaller blocks assigned under multihoming policies that are similarly underutilised, and could shed a large proportions of their advertised allocation as their holders discover it’s worth more to them in someone else’s hands than in their own.

So I’m going to lick my finger and stick it in the wind. I think we have ten years or so before we really, genuinely run out of IPv4 addresses, and that ignores the transition to IPv6 completely. In reality, as IPv4 addresses become scarce (read: expensive), we’ll see folks making do with less and looking harder at IPv6 transition, so I doubt we’ll ever actually run out. Sure, there’s a whole bunch of stuff you can’t do without lots of addresses, but those applications will simply have to go to IPv6.

Don’t get me wrong; I’m not suggesting for a moment that we don’t have to worry. The single thing that will prevent exhaustion is money. Scarce resources have value; the more scarcity, the more value. RIRs have some really hard choices ahead of them; they’re going to be in the firing line to manage the emerging market in IPv4 address space. Pretending that organisations don’t “own” their address space will stop being an option; the court cases haven’t started in earnest yet, but unless the RIRs urgently awake from the fantasy that IP address space is not a tradable asset, they will.

Either they will rise to the challenge, or they’ll swept into irrelevancy. I rather hope the latter doesn’t happen, because the alternative is anarchy. The best we can hope for is that enough wiser heads prevail to ensure that the emerging IP address bourses have sufficient support to ensure that the fabric of the Internet isn’t torn apart by the conflict between those who long for a non-commercial Internet where everyone plays nice, and the immediate needs of a market where folks need to get stuff done.

I hate IPv4 link broadcast interface (e.g. Ethernet) addressing semantics.  To recap, if I have two boxes on each end of a point-to-point link (say between a gateway and an end host), we address as follows (for example):

  • Network address (reserved)
  • Host 1 (gateway)
  • Host 2 (end host)
  • Broadcast address.

That’s four IP addresses, for a link to a single host.  Hello?  Haven’t you heard the news?  IP addresses are running out!

Some folks manage to get away with using /31 masks, e.g.

  • Host 1 (gateway)
  • Host 2 (end host)

which is just wrong.  Better in terms of address usage (two addresses instead of four), but still just plain wrong. An you’re still wasting addresses.

The PPP folks a long time ago figured that a session, particularly in client to concentrator type configurations, only needs one IP address. A “point to point” interface has a local address, and a remote address, of which only the remote address needs to be stuffed in the routing table.  The local address can be the address of the concentrator, and doesn’t even need to be in the same subnet.

So why can’t my Ethernet interfaces work the same way?

A point to point link really doesn’t have broadcast semantics.  Apart from stuff like DHCP, you never really need to broadcast — after all, our PPP friends don’t see a need for a “broadcast” address.

Well, we decided we had to do something about this.  The weapon of choice is NetGraph on FreeBSD.  NetGraph basically provides a bunch of kernel modules that can be linked together.  It’s been described as “network Lego”.  I like it because it’s easy to slip new kernel modules into the network stack in a surprising number of places. This isn’t a NetGraph post, so I won’t spend more verbiage on it,but it’s way cool. Google it.

In a real point-to-point interface, both ends of the link know the semantics of the link.  For Ethernet point-to-point addressing, we can still do this (and my code happily supports this configuration), but obviously both ends have to agree to do so. “Normal” clients won’t know what we’re up to, so we have to do this in such a way that we don’t upset their assumptions.

So we cheat. And we lie. And worst of all,we do proxy ARP!

What we do is tell our clients that they are on a /24 network. Their IP address is, for example,, and the gateway is Any time we get a packet for, we’ll send it out that interface, doing ARP as normal to resolve the remote host’s MAC address.

Going the other way, we answer ARP requests for any IP address in, except, with our own MAC address.  That means that if they ARP for, we’ll answer the ARP request, which directs that packet to us, where we can use our interior routes to route it correctly.  In our world, two “adjacent” IP addresses could be on opposite sides of the network, or it could be on a different VLAN on the same interface.

The result is one IP address per customer.  We “waste” three addresses per 256, the network (.0), gateway (.1) and broadcast (.255), and we have to be a bit careful about what we do with the .1 address — it could appear on every router that is playing with that /24.  But we can give a user a single IP address, and put it anywhere in the network.

We can actually have multiple IP addresses on the same interface; we do this by having the NetGraph module have a single Ethernet interface but multiple virtual point-to-point interfaces.  So if we want to give someone two IP addresses, we can do that as two, not necessarily adjacent, /32 addresses.  We don’t answer ARPs for any of the assigned addresses, but do answer everything else. The module maintains a mapping of point-to-point interface to associated MAC address.

The following is a technique I’ve used over the last decade or so for distributing web traffic (or potentially any services) across multiple services, using just DNS.  Being an old DNS hack, I’ve called this technique Poor Man’s Anycast, although it doesn’t really use anycasting.

But before we get into the technique, we need to make a brief diversion into the little-known but rather neat feature of the DNS, or more accurately, DNS forwarders, which makes this a cool way to do stuff. The feature is name server selection.

Most DNS clients, and by this I include your home PC, make use of a DNS forwarder.  The forwarder is the thing that handles (and caches) DNS requests from end clients, while a DNS server carries authoritative information about a limited set of domains and only answers queries for them. These two functions have historically been conflated rather severely, mainly due to the use of BIND for both, and why this is a bad thing is the subject for a whole other post.

Moving right along. A DNS forwarder gets to handle lots of queries for any domain that its clients ask for. When you ask for, it asks one of the root servers (, et al) for that full domain name (let’s assume it’s just come up and doesn’t have anything cached).  It gets back a delegation from the root servers, saying basically, “I don’t know, but the GTLD (.com, .net) servers will”, and tell you where to find the GTLD servers ( et al).  You ask the one of the GTLD servers, and get back an answer that says that they don’t know either, but and do.

You then ask (say), and hopefully you’ll get the answer you want (e.g. the IP address).

Now, along the way, the forwarder has been caching everything it got. Every time it asks a name server for data, it stores the time it took to reply. That means that when looking up names in, the forwarder has been collecting timing and reliability data which it uses to choose which name server to ask next time, as well as the answers it received.  So if answers in 20 ms, but answers in 10 ms, roughly two thirds of the queries for will be sent to If the timing difference is much greater, the split of queries will be even more marked. Similarly, if a name server fails to respond at all, that fact will be reflected in the accumulated preference assigned to that server, and it will get very few queries in future; just enough so that we know we can start sending it queries again when it comes back.

This is a powerful effect, and is of particular use when distributing servers over a wide geographical area. DNS specialists know about it, because poor DNS performance affects everything, and DNS people don’t like adversely affecting everything. (They’re really quite paranoid about it. Trust me, I’m one.) But it can also be used to pick the closest server for other things as well.

After all, closeness (in terms of round-trip time) is very important in network performance (see my post on bandwidth delay products).

The technique is as follows.  Let’s say we have three web servers, carrying static content. Call them, say,, and Let’s say that they’re widely disparate. All three servers carry content for

So, we start by configuring, on the name servers:

$TTL 86400
www     IN      NS
        IN      NS
        IN      NS

We then run a DNS server on all three web servers.  We configure the servers with a zone for along the lines of:

$TTL 86400                       ; Long (24 hour) TTL on NS records etc
@       IN      SOA (
                                2009112900 3600 900 3600000 300 )
        IN      NS
        IN      NS
        IN      NS
$TTL 300                         ; Short (five minute) TTL on A record
@       IN      A ; Set this to host IP address

Now the key is that each web server serves up its own IP address. When a DNS forwarder makes a query for a, it will be directed to one of, or But as more and more queries get made, one of those three will start handling the bulk of the queries, at least if that one is significantly closer than the other two. And if gets the query, it answers with its  own IP address, meaning that it also gets the subsequent HTTP request or other services directed to it. The short DNS TTL (5 minutes in the example) mean that the address gets queried moderately often, allowing the name server selection to “get up to speed”. Much longer TTLs on the name servers mean the data doesn’t get forgotten too quickly.

The result is that in many cases, the best server gets the request.

The technique works best if there are lots of domains being handled by the same set of servers, and there are lots of requests coming through. That way the preferences get set quickly in the major ISPs’ DNS forwarders. The down side of the technique is that far away servers will still get some queries. This non-determinism may be a reason for not deploying this technique.  If you want determinism, you’ll need to look at more industrial grade techniques.

Now, this isn’t what players like Akamai do, and it isn’t what anycasting is about. Akamai and (some) other content distribution networks work by maintaining a map of the Internet, and returning DNS answers based on the requester’s IP address. But this is a fairly heavyweight answer to the problem. It’s not something you can implement with just BIND alone.

Anycasting on the other hand relies on advertising the same IP address in multiple places, and letting BGP sort out the nearest path. This has three disadvantages:

  1. It potentially breaks TCP. If there is an equal cost path to a given anycast node, it’s possible one packet from a stream might go one way, while the next packet might be sent to a completely different host (at the same IP address). In practice, this has proven to be less of a problem than might be expected, but there is still scope for surprises.
  2. Each of your nodes has to be separately BGP peered with its upstream network(s). That’s a lot more administration than many ISPs will do for free.
  3. Most importantly, being close in BGP terms is not the same as being close physically or in terms of round-trip time. Many providers have huge reach within a single AS, so a short AS-path (the main metric for BGP) may actually be a geographically long distance, with a correspondingly long round-trip.

The other nice thing about poor man’s anycast is that it’s dynamic; if a node falls off the world, as long as its DNS goes away too, it’ll just disappear from the cloud as soon as the TTLs time out. If a path to it gets congested, name server selection will notice the increased TTL and de-prefer that server.

And of course you don’t need to be a DNS or BGP guru, or buy/build expensive, complex software systems to set it up.

I found myself explaining this one at Curry tonight, in the context of discussing fast broadband.

Basically, if you have a reliable stream protocol like, to take a random example, TCP, and you’re not doing anything imaginative with it, you run into the following problem:

Every byte you send might need to be resent if it gets lost along the way.  So, you buffer whatever you send up until you get an acknowledgement from the other end.  Let’s say, for argument’s sake you use a 64k buffer. We call this buffer the window, and the size is the window size.

Now, let’s say you have a looooonnnngggg path between you and your remote endpoint. Let’s say it’s 200 milliseconds, or 1/5th of a second. This is pretty reasonable for an NZ-US connection — the speed of light is not our friend.

And finally, for simplicity sake, let’s say that the actual bandwidth over that path is Very High, so serialisation delays (the time taken to put one bit after the next) are negligible.

So,  if I send 64k bytes (or 512 k bits) worth of data, it takes 200 ms before I get an acknowledgement. It doesn’t matter how fast I send my 64k; I still have to stop when I’ve sent it.  200 ms later, I get a bunch of acknowledgements back, for the whole 64k (assuming nothing got dropped), and I can now send my next 64k.

So the actual throughput, through my SuperDuperBroadband connection, is 64k bytes per 200 ms, or 2.5 Mbps.

To turn this around, if I want 2.5 Mbps at 200 ms, I need a 64k byte window; if I want 5 Mbps on a 200 ms path, I’m going to need to up the window size to 5 Mbps times 200 ms = 128 k bytes.

That window size calculation is the bandwidth delay product.

There’s ‘s the theory. Pick a big window size and go fast. Except:

  1. You don’t get to pick. Even if you control your application, for downloads you can ask for a bigger window size, but you don’t necessarily get it. Probably, you’ll get the smaller of what the applications at either end asked for.
  2. Standard, 1981 edition TCP has the window (buffer) size that can be communicated by the endpoints maxed out at 64k. This isn’t the end of the world; in 1992 Van Jacobson and friends rode to the rescue with RFC 1323, which allows the window size to be scaled, to pretty much anything you like. But most TCP stacks come with a default window size in the 64k-ish range, and may applications don’t change it.
  3. Even if both ends of a TCP session ask for and get a large maximum window size, they don’t start with it. TCP congestion control requires that everyone start slowly (it;s called slow start), and this is done by starting with a small window size, and increasing it as the acknowledgements flow in and the sending end can get an idea about how much bandwidth there is.  So if your application uses lots of short TCP sessions rather than one long one, you’ll never reach your maximum window size and therefore never saturate your connection.

What to do? It depends what you’re trying to achieve.  For file transfers, run lots of TCP sessions side by side – can anyone say BitTorrent? Local caching helps for web traffic; move the content closer, and the bandwidth delay product is less. Use a different protocol. I have to say I’ve seen quite a few UDP-based file transfer protocols come and go, because tweaking TCP parameters at both ends is usually a darn sight easier than getting a new protocol right. (see don’s law).

What it comes down to, is that if all you’re going to use your UltraSuperDuperFast broadband connection for is downloading videos from US servers, you’re going to be disappointed. The real key to making this useful is local, or at least, locally hosted, content. Preferably located right by the fibre head-ends. It’s a parallel stream to the effort to get the fibre in the ground and get it lit, and it needs to be attended to PDQ.