2012-02-17

The datacentre is the new laptop


Sepr on CK1 at PRSC

HP has just announced its forthcoming Gen 8 servers. Rather than go on about the usual stuff: CPUs, I/O bandwidth etc., or even the trend to put solid state storage off the PCIx bus, what's interesting to me is this: the servers are explicitly designed to be part of a larger system, a datacentre.

The existing products, they are individual servers you just happen to put into racks, you just happen to hook up to a switch you've stuck at the top of the same rack. The racks may be set up into hot rack/cold rack, but that's mostly a deployment detail the servers don't care about, except in ensuring airflow is good.

This has now changed.

The Datacenter as a computer argued that the software developers need to recognise that a datacentre is the new execution platform, one with mixed availability, limited bandwidth and other concerns that could be ignored before -or at least treated as the special case of "distributed systems", rather than what we have now: "systems". Everything is distributed.

These hardware changes mirror that. Here are some of the new concerns for both the ops team -and the applications themselves.
  1. Re-integration of Storage and Computation.
  2. Availability though replication. Less RAID-style hardware, more
    replication across machines.
  3. Inventory tracking -especially for identifying failure points, such as monitoring the history of specific batches of disks. If some appear particularly unreliable, you want to find all of them.
  4. Networking: 10 GbE is still a luxury, bonded 2x1 GbE good for availability too. Understanding network failures in data centers shows why ToR switches become the dominant network failure point in a cluster -and from a re-replication perspective, that's not ideal.
  5. Power management. Beyond just PUE, the metric of datacentre overhead, power consumption in the servers is a big concern.

The new servers then, are designed to live in this world.

Inventory They work out from the rack (don't ask me how, I don't know these things) where they are on it -information that can be propagated to the management tools so that they can be used for inventory tracking.

Networking Lots of ethernet ports. Some slow and inexpensive for management, faster ones for the application.

Power. This work here is something you can point to Chrandrakant Patel in HP Labs for. If you look at his published work, you can see a lot of it is about airflow and cooling in a datacentre. If you can improve that -as the container hosted datacentre pods can do- then your PUE is better. Why instrument the inside of the servers? It ensures that you can keep the hardware within its limits, because you have a better idea of what is going on inside. Every extra degree F, C or K you can take the air up, lets you save a lot of money over time. Yet the risk of overheating -and the cost of doing so- makes this dangerous. Knowing what is happening inside the servers give you more confidence of what's happening.

This is what the new servers enable. Which means that we are going from servers that you stack to servers that are designed to locate themselves in the racks, ideally hosted within a datacentre container that is optimised for airflow and designed to work as close to the limits of temperature as is considered safe based on the information coming out of the servers themselves.

Which is very close to what a laptop does: a box with optimised airflow and fans that come on when they feel it is important, and with a power budget that the system is designed to optimise. The datacentre is the new laptop, at least from a power and cooling perspective.


Now, what about the software? If the datacentre-level application infrastructure can get at the power, topology and network information, it could adapt itself better.

The topology information that the servers can determine could be used to dynamically generate the topology map for the cluster. It is entirely co-incidental that I'm typing this while my new topology patches are being tested in a adjacent console, but those changes (better support for topology sources other than the script runner, ability to dump the current topology) are effectively a precursor. I wouldn't do some fancy integrated java module though -better to have a topology source that just reads a java properties file and by polling for changes, can react to moving topologies. Let the management tooling generate that and it would propagate into HDFS and the RM/MR layer.

Power? If overheating is a problem, that server can be clocked back, which makes it slower. It may be better to actually tell the resource manager that there are less slots on that box, so reducing its actual workload. This could ensure that the work running in the remaining slots doesn't take longer than normal to complete.

Networking? We really need a way to get more information about the network backplane into the application -including the amount of bandwidth currently allocated to applications. Bandwidth can be a precious resource, but right now there is better tooling to manage it in a bittorrent client than there is between applications in a datacentre.

This is a challenge and an opportunity. A challenge: this information needs to be extracted and forwarded to the applications -which then need to act on it. An opportunity -it will make the applications and datacentres work better. Wave goodbye to writing topology scripts that don't work, say hello to being able to move servers around and have them the application infrastructure work out where they are. Worry less about uncontrolled backbone bandwidth use in a shared datacentre; have some policy tooling to manage it across applications. As for power, hope to see the electricity bills decrease.


[Artwork: Sepr on Jamaica Street, Stokes Croft]

2012-02-05

Just because you can rewrite your codebase doesn't mean you should have to



3DOM on Richmond Road: Remember the future

The strength of automated driven test suites is that you can verify that all the testable/covered parts of your system still work, even after major changes.

The strength of modern IDEs is at a click of mouse you can find wherever a class or method is used, so go to those places and edit them.

Even so: availability should not imply necessity: life is best when you don't have to use these features unless you really, really want to. Everything should work.

And usually it does. But not last week. A small bug surfaced on tuesday: you got an error marshalling strings with square brackets around, "[]". At first I did the obvious tactic: deny that this could be happening, but after repeated evidence to the contrary I sat down and had a look.

It turns out that the library, json-lib, with the same interface as the java Map interfaces, has an extra "feature" in that whenever you add a string attribute, that string value gets parsed as JSON if it starts with "{" or "[". Which means that parentnode.put("request","4,5") would add the string attribute "request":"4,5" to the parent node, the slight variation parentnode.put("request","[4,5]") would generate the attribute "request":[ 4 , 5]. This is fundamentally different, and challenges any assumption the recipient had about types in marshalled data, as the types varies depending on the contents of the strings being marshalled.

Needless to say, I was unhappy. At least by the point that unhappiness was reached, there were now some tests to work out what was going wrong. Which made fixing it possible, the fix being to always single quote strings being added, such as parentnode.put("request","'[4,5]'"). When the put() operation is invoked, the single quotes are stripped, and the unparsed inner values become the attribute. With careful wrapping of the put operations -and ensuring that only one set of double quotes is placed around every string attribute, the tests passed, everything worked, and a new nightly release went out with the issue marked as closed.

Except the next day, the issues came back. Because it wasn't fixed. Not once you took that node, parentnode, and added it under another node: messagenode.put("payload", parentnode).

Doing that appears to trigger a reparse of every string value. Which means all safely planted array strings end up being reparsed, and converted from strings to JSON arrays.

At this point, the unhappiness level changed from "medium" to excessive. With the new tests replicating this behaviour, and no obvious in-source switch to say "be less helpful", the only solution that seemed 100% likely to work with all possible payloads and orderings of payload construction was to rip out the entire library and replace it with one that did not exhibit the same behaviour.

Which is what I spent Thursday and Friday doing: a complete removal of all uses of json-lib and replacement with Jackson. Which, I not only regret in time wasted, but in the way that the json-lib Java-friendly object model is better than the Jackson "look like the DOM" world view, because the DOM, ubiquitous as it is, is pretty painful. Yes, with experience of the XML parsing world, I can certainly use it -but that did not mean I enjoy it. I've had to rip out and place server side and client side code, other things that are visible to other bits of the system that used the same JSON library. I haven't fixed all that code -but add enough downconvert/upconvert that the code compiles and appears to work.

Four hours over two days writing tests to show this problem existed -then
two full days of repair: switching libraries, backtracking on method calls, changing types and seeing won't compile, running tests. Some extra tests this weekend and then only the merge with two days worth of other peoples changes and some reruns and it's ready to commit. That will leave Jenkins to do the final retest and email, then everyone downstream will have to fix their side of the down- and up- converted code (the conversion methods marked as @Deprecated to make them easy to spot), which will take 1+ hour on monday for a couple of people.

When we look back at this, what will we be able to say?

We can now reliably send strings with square brackets around.

This is one of those weeks that I would never use in a motivational talk for anyone interested in taking software engineering.

[Artwork: 3Dom, "Remember the Future", Richmond Road, Montpelier]