2013-02-26

Followup: where can we do interesting things at low-levels

I managed to offend Andrew Purtell of Intel in my last post, so to make up here's some issues why Hadoop benefits from his work


Jamaica Street Feb 2012

Hardware is changing, CPU arch, memory layouts, cross-core synchronization, instructions built in the CPU. For efficient code we need everything from minimal branch misprediction to things that cut down core operations -especially decompression, checksums, encryption.

That's why the amount of native code in Hadoop is only going to increase -and that's going to take more than just skill to write -it takes the work to identify the problems and bottlenecks in the first place.


Java gets in the way here. You don't know the layout of a java class in memory, so it's near impossible to design cache-efficient designs, pulling in a class on a single cache line, avoiding false sharing within a struct. Even java's volatile opcode forces in a synchronisation barrier -making it much more expensive than C/C++ volatile. It's memory manager didn't go NUMA-aware until Java7 -it's been lagging hardware, but openJDK may speed things up there. Except now there's the problem of testing Hadoop on trunk builds of openJDK.

Looking forward, the end of HDDs is inevitable, but before then, mixed SSD and HDD will be the trend -and HDFS has to use that, somehow deciding where to put data, and knowing where it is.

If you are scheduling an MR job, and you have fast 10GbE networking, then the job placement problem becomes less of "where is there a slot on any of the three server with the data" and more "which of my 3 replicas are in SSD" -then doing work on or near that server, as things will complete that much faster.

And to have some storage of replicas across tiers will need HDFS knowing about the options, and making decisions. Those decisions will be hard, and you'll need something like a rebalancer in each host, moving cold data out of the SSD, so leaving space for warm data. Unless the SSD is actually used for write-through data, in which case the server you wrote two now has two copies, one on SSD, one on HDD. That may be a viable layout for two of the replicae. With P(SSD-fail) << P(HDD fail), you may even consider having more two-copy blocks, though there you would want them on separate boxes.

Other fun areas are OS integration, the underlying filesystem and networking stack. If you look at how today's servers & OSes are tuned, its towards things like databases. In a world where it's many-server designs, parallel operations across a cluster, things could be different

So yes: lots of areas to play in here, all welcome. Especially with tests.

[Photo: Jamaica Street. Centre looks like it is by Paris]

No comments:

Post a Comment

Comments are usually moderated -sorry.