Archive for January, 2010

I hate IPv4 link broadcast interface (e.g. Ethernet) addressing semantics.  To recap, if I have two boxes on each end of a point-to-point link (say between a gateway and an end host), we address as follows (for example):

  • 10.1.1.0: Network address (reserved)
  • 10.1.1.1: Host 1 (gateway)
  • 10.1.1.2: Host 2 (end host)
  • 10.1.1.3: Broadcast address.

That’s four IP addresses, for a link to a single host.  Hello?  Haven’t you heard the news?  IP addresses are running out!

Some folks manage to get away with using /31 masks, e.g.

  • 10.1.1.4: Host 1 (gateway)
  • 10.1.1.5: Host 2 (end host)

which is just wrong.  Better in terms of address usage (two addresses instead of four), but still just plain wrong. An you’re still wasting addresses.

The PPP folks a long time ago figured that a session, particularly in client to concentrator type configurations, only needs one IP address. A “point to point” interface has a local address, and a remote address, of which only the remote address needs to be stuffed in the routing table.  The local address can be the address of the concentrator, and doesn’t even need to be in the same subnet.

So why can’t my Ethernet interfaces work the same way?

A point to point link really doesn’t have broadcast semantics.  Apart from stuff like DHCP, you never really need to broadcast — after all, our PPP friends don’t see a need for a “broadcast” address.

Well, we decided we had to do something about this.  The weapon of choice is NetGraph on FreeBSD.  NetGraph basically provides a bunch of kernel modules that can be linked together.  It’s been described as “network Lego”.  I like it because it’s easy to slip new kernel modules into the network stack in a surprising number of places. This isn’t a NetGraph post, so I won’t spend more verbiage on it,but it’s way cool. Google it.

In a real point-to-point interface, both ends of the link know the semantics of the link.  For Ethernet point-to-point addressing, we can still do this (and my code happily supports this configuration), but obviously both ends have to agree to do so. “Normal” clients won’t know what we’re up to, so we have to do this in such a way that we don’t upset their assumptions.

So we cheat. And we lie. And worst of all,we do proxy ARP!

What we do is tell our clients that they are on a /24 network. Their IP address is, for example, 10.1.2.5/24, and the gateway is 10.1.2.1. Any time we get a packet for 10.1.2.5, we’ll send it out that interface, doing ARP as normal to resolve the remote host’s MAC address.

Going the other way, we answer ARP requests for any IP address in 10.1.2.0/24, except 10.1.2.5, with our own MAC address.  That means that if they ARP for 10.1.2.6, we’ll answer the ARP request, which directs that packet to us, where we can use our interior routes to route it correctly.  In our world, two “adjacent” IP addresses could be on opposite sides of the network, or it could be on a different VLAN on the same interface.

The result is one IP address per customer.  We “waste” three addresses per 256, the network (.0), gateway (.1) and broadcast (.255), and we have to be a bit careful about what we do with the .1 address — it could appear on every router that is playing with that /24.  But we can give a user a single IP address, and put it anywhere in the network.

We can actually have multiple IP addresses on the same interface; we do this by having the NetGraph module have a single Ethernet interface but multiple virtual point-to-point interfaces.  So if we want to give someone two IP addresses, we can do that as two, not necessarily adjacent, /32 addresses.  We don’t answer ARPs for any of the assigned addresses, but do answer everything else. The module maintains a mapping of point-to-point interface to associated MAC address.

Seriously.  They don’t like it.  They sulk.

Brendan Gregg of the Sun Microsystems Fishworks engineering team, has written up this effect, with video, at http://blogs.sun.com/brendan/entry/unusual_disk_latency

Moreover, don’t vibrate your drives.  Why an I saying this?

Because, three months ago we took delivery of three 1U pizza boxes. They’re small Supermicro boxes, with room for a normal ATX motherboard and a hard drive.  We equipped these with terabyte drives, fairly normal Supermicro motherboards, 3 GHz Core2 Duo CPUs and 8GB memory each.

They just didn’t run right.  Occasionally, one wouldn’t even make it through an OS install, and the ones that did wouldn’t put through as much work as a much lower spec machine.

We suspected the drives; we suspected the power supply.  Actually, we really thought it was the power supply, but even though the PSUs on these chassis were small, and the 12V rails seems to be running slightly low, at 11.85V, no amount of bashing the numbers suggested that the systems were actually underpowered.

The first breakthrough was running “hdparm -t –direct /dev/sda” on the drive, which showed wildly fluctuating numbers, consistent with the behaviour we were seeing.  So it was something to do with the disk subsystem.

The next breakthrough was when we discovered that if we unplugged the chassis fan (an ugly centrigufal thing) from the motherboard, the problem went away.  The hdparm numbers stabilised at 100MB/s or more.

We saw small changes in power supply volts when we did this, so we were still suspecting the power supply.  I put an ammeter on the fan power line, to see how much power the fan was pulling.  1.2A at full speed.

We played with the fan speed in the BIOS; at its lowest speed, it would pull 0.25A, and the drive would perform well; at the “server” setting, with the server otherwise unloaded, it would pull about 0.6A.  At that rate, it was starting to have an effect on performance.

This was a PSU that was supposed to be able to deliver 18A on the 12V rail, and 260W total.  I really couldn’t see how the 12V would be at the edge when the PSU was pulling less than 100W (measured at the AC feed) and was running three fans and a hard drive and a few minor bits and pieces like the serial port and network interface, all of which should have summed to maybe 5A.  The numbers didn’t add up.

Finally, I had a brainwave.  I removed the fan from the chassis, still running.  The problem went away.  I touched the fan to the drive.  The drive throughput dropped through the floor.

After a few more experiments, the conclusion is that with the fan mounted close to the drive, the vibrations were enough to upset the performance of the drive, consistently.  Two different terabyte drives (one Seagate, one Western Digital) exhibited the same problem.

I duplicated this by applying abnormal vibration to the case of my desktop PC (half terabyte Seagate), and even the grottly little thing I have at home (a Seagate 160GB PATA drive).

Conclusion: all modern drives are subject to potentially serious performance issues when faced with abnormal vibration.  The Supermicro chassis exacerbated the problem  because of the placement of the fan with respect to the drive, and the fact the drive is mounted directly to the chassis.  Also, the placement of cables up against the fan meant that vibrations were being transferred directly through the connectors from the fan; somthing that could be partially alleviated by re-routing the power cable under the fan.

The fact that right angle SATA power connectors are so darned hard to get made this more of an issue than it should have been.

I think a bit of judicious use of closed-cell foam packing, turning the fan speed down, and re-routing cables away from the fan will finally solve the problem.

Hopefully.