2013-10-16

Hadoop 2: shipping!

Deamz & Soker: Asterix

The week-long vote is in- Hadoop 2 is now officially released by Apache!

Anyone who wants to use this release should download Hadoop via the Apache Mirrors.

Maven and Ivy users: the version you want to refer to is 2.2.0

The artifacts haven't trickled through to the public repos as of 2013-10-16-11:18 GMT -they are on the ASF staging repo and I've been using them happily all week.

This release marks an epic of development, with YARN being a fundamental rethink of what you can run in a Hadoop cluster: anything you can get to run in a distributed cluster where failures will happen, the Hadoop FileSystem the API for filesystem access  -be it in Java or a native client- and data is measured by the Petabyte.

YARN is going to get a lot of press for the way it transforms what you can do in the cluster, but HDFS itself has changed a lot. This is the first ASF release with active/passive HA -which is why Zookeeper is now on the classpath. CDH 4.x shipped with an earlier version of this- and as we haven't heard of any dramatic data loss events, consider it well tested in the field. Admittedly, if you misconfigure things failover may not happen, but that's something you can qualify with a kill -9 of the active namenode service. Do remember to have to >1 zookeeper instance before you try this -testing ZK failure should also be a qualification process. I think 5 is a number considered safer than 3, though I've heard of one cluster running with 9. Nobody has admitted going up to 11.

This release also adds to HDFS
  1. NFS support: you can mount the FS as an NFS v3 filesystem. This doesn't give you the ability to write to anywhere other than the tail of a file -HFDS is still not-Posix. But then neither is NFS: its caching means that there is a few seconds worth of eventual consistency across nodes (\cite{Distributed Systems, Colouris, Dollimore & Kindberg, p331}).,
  2. Snapshots: you can snapshot some of a filesystem and roll back to it later. Judging by the JIRAs, quotas get quite complex there. What it does mean is that it is harder to lose data by accidental rm -rf operations.
  3. HDFS federation: datanodes can store data for different HDFS namenodes, -Block Storage is now a service- while clients can mount different HDFS filesystems to get access to the data. This is something of primarily of relevance to people working at Yahoo! and facebook scale -everyone else can just get more RAM for their NN and tune the GC options to not lock the server too much]
Hadoop 2 also adds is extensive testing all the way up the stack. In particular, there's a new HBase release coming out soon -hopefully HBase 0.96 will be out in days. Lots of other things have been tested against it -which has helped to identify any incompatibilities between the Hadoop 1.x MapReduce API (MRv1) and Hadoop 2's MRv2, while also getting patches into the the rest of the stack where appropriate. As new releases trickle out, everything will end up being built and qualified on Hadoop 2.

Which is why when you look at the features in Hadoop 2.x, as well as headline items "YARN, HDFS Snapshots, ...", you should also consider the testing and QA that went into this -this is the first stable Hadoop 2 release -the first one extensively tested all the way up the stack. Which is why everyone doing that QA -my colleagues, Matt Foley's QA team, the Bigtop developers, and anyone else working with Hadoop that built and tested their code against Hadoop 2.1 beta and later RCs -and reported bugs.

QA teams: your work is appreciated! Take the rest of the week off!

[photo: Deamz and Soker at St Philips, near Old Market]

No comments:

Post a Comment

Comments are usually moderated -sorry.