Death by Snow

The NY Times has one of the most beautiful HTML5 web articles to date, Snow Fall.

Beyond the shine, the story is about a group of skiiers getting avalaunched on a gully on the back side of Steven's Canyon ski resort in the Cascades. The Cascades being famed and treasured for the large volumes of (heavy) snow that it can get dumped on in a 24 hour period -a metre a day for a number of days in a row sometimes.

In the story, a group of skiers went down the "tunnel creek" after fresh snowfall onto that from two weeks previously.
By morning, there would be 32 inches of fresh snow at Stevens Pass, 21 of them in a 24-hour period of Saturday and Saturday night.
That was cause for celebration. It had been more than two weeks since the last decent snowfall. Finally, the tired layer of hard, crusty snow was gone, buried deep under powder.
Given  that you know this a new article where the outcome is not good, you can look at that say "about a metre of fresh snow on a layer which would have frozen together on the surface over the previous two weeks", and immediately conclude what's going to happen: that fresh snow isn't going to bond to the previous layer, creating a shear point that's just waiting to trigger.


That doesn't make the rest of the story any better -it's a brutal documentary of what happens when snow does what it often does after a big snowfall: slides down the mountain.

Off piste skiing isn't skiing, it's winter/spring mountaineering with skis on. Skis that give you speed, but also bias you towards going on the snowy areas, not the rocky bits. Usually it can be great fun -but it puts you right where avalanches happen.

This article is awful for anyone to read -but if you've been into winter and/or ski mountaineering it's worse: its a documentary of what's happened to friends of yours, and what could happen to you.

[photo, ski randonne work in Belledone Range, French Alps, 1994? Skis: Volkl. Camera Canon. Film and Paper: Ilford ]


Sorry: I ignore LinkedIn requests from people I don't know

This is an update of my existing policy: I tend to ignore LinkedIn requests from people I don't know.
Stokes Croft Xmas 2012 decorations

If you have been sent a link to this page after you extended an invitation to connect to me on LI, then sorry,  it appears you've fallen in to this category. This may be because
  1. I don't know you. As I use I use LinkedIn primarily as an email address book, adding your email address to only creates confusion for me later on.
  2. You are an HR recruiting person who hasn't read my critiques of Hadoop recruiting strategies. LI is not the place to find me; trying to connect to me on LI without even paying for a premium account doesn't make you look serious about recruiting -and doesn't benefit the Hadoop ecosystem. And I'm having fun at Hortonworks, so approaching me is a waste of time unless you want your plans made public.
  3. We have met, I have just completely forgotten about it as I am better at remembering email addresses than names or faces. 
If it's option #3: please retry with some better context than the stock "I'd like to add you to my professional network on LinkedIn".

If it's options 1 or 2, LinkedIn is not the way to approach me. I am not trying to build up a vast network -I primarily use it as my address book for people I've worked with on Apache projects, or other people I've worked with. Not as a way of keeping a list of people I don't know.

As I've stated before, LinkedIn actually measures the accept:reject ratio of invitation requests. If I accept invitations from people I don't know, that devalues all my other links and does their graph no benefit at all.


[photo: xmas 2012 graffiti off stokes croft]


why you should vote for "Hadoop: Embracing Future Hardware"

At some point in the next 10-15 years, the last "rotating iron" hard disk will be made.

That's a profound thought. Admittedly, I may get the date wrong, but the point remains. Just as the CRT, the floppy drive and the CD has gone away, hard disks will become a rarity.

Who cares? Those of us building the future Hadoop platforms do.

Star Wars BBQ

GFS& MapReduce, Hadoop HDFS and its MR Engine, are all designed to take advantage of "commodity hardware". That means rather than pay for top of the line Itanium, PowerPC or Sparc servers running a Sysv-derived Unix, they use servers built from x86 parts running Linux. This is not because of any ideological support of the x86 architecture: nobody who has ever written x86 assembler or debugged win32 C++ apps at that level will be fond of the x86. No, x86 parts were chosen as they were the servers with the most cost effective performance, a manageable power budget (compared to Itanium) and because people made servers with them on board.

And why are x86 parts so cost effective -even though they have so many millions of transistors Because Intel have managed to take the revenue from each generation of parts into funding the R&D work and new fabs needed for the next generation of CPU parts and the processes to manufacture them.

It is the mass consumer and corporate demand for PC desktops that has given us affordable high-performance x86 parts,

Even if the Xeon stuff doesn't work in the desktop, the fabs and the core design are shared -the volumes kept the cost down.

With the emergence of phones and tablets as the new consumer internet access point, sales of PC parts are flatlining, and may decrease in future. Our home PC is used as a store for photographs and  a device for a ten year old to play minecraft or -or to watch youtube videos of minecraft. He isn't committed to intel parts, and as for the photgraphs, well, 1TB of cloud storage isn't affordable -yet- but that may change. And when your phone can upload directly to facebook, why faff around downloading things to a local PC?

Even enterprise PCs are changing, they are called "laptops" and SSD storage is moving down from the "ultrabook" class of devices to becoming mainstream -at a guess within 3-5 years they'll be SSD everywhere.

The world of end user devices are changing -which is going to have implications for servers. We need to look at those trends and start planning ahead, not just to handle the "what happens when HDDs go away" problem, but "how can we make best use of these new parts in 18-24 months?

Which brings me round to the whole point of this article: my other talk is Hadoop: Embracing Future Hardware,

Vote for it. If not, you'll be taken by surprise when the future happens around you while you weren't looking.

[Photo: something from the harbourfest , 2008l]


Why "Taking Hadoop to the Clouds" is the talk to vote for

The Hadoop summit vote list is up, and I have two proposals -currently undervoted. Even though I'm on the review committee for the futures strand, not even I could push through a talk which had zero votes on it -ideally I'd like my talks to get in through popular acclaim. I could just create 400 fake email addresses and vote-stuff that way, but I'm lazy.

For that reason, I'm going to talk in detail about why my talks will be so excellent that to even think about having them left out could be detrimental to the entire conference.
Page 6 guy interviews

One of my talks is "Taking Hadoop to the Clouds".

There are two competitors
  1. Deploying Hadoop in the Cloud, which looks at options, details and best practices. I don't see anything particularly compelling in the abstract -I assume it's got more votes as it's the one that comes up first. Or they are trying the many-email-address-vote-stuffing technique(*).
  2. How to Deploy Hadoop Applications on Any Cloud & Optimize Price Performance.  This could be interesting, as it covers how CliQr deploys Hadoop on different infrastructures. It sounds like a rackable-style orchestraction layer above infrastructures, for Hadoop it may have similarities with MastodonC's Kixi work,
Why then, should people vote for mine?

I'm giving the talk.

This is not me being egocentrically smug about the quality of my presentations, but because I'm reasonably confident I know a lot about the area.
  1. My last time at HP Labs was spent on the implementation of the "Cells" virtual infrastructure: declarative configuration of the entire cluster design. The details were presented at the 5th IEEE/ACM conference on Utility and Cloud Computing, and will no doubt be in the ACM library. This means I know about IaaS implementation details; the problems of placement, why networking behaves the way it does, image management, what UIs could look like, what the APIs could be, etc.
  2. I've spent a lot of time publicly making Hadoop cloud-friendly. I presume that MS Azure and AWS ElasticMR have put in more hours, but unless they're going to talk about their work, Tom White and myself are the next choices. Jun Ping and VMWare colleagues have done a lot too -and big patches into the codebase, but I don't see any submissions from them.
  3. I have opinions on the matter. They aren't clear cut "cloud good/physical bad" or "physical bad/cloud good". There are arguments either way; it depends on what you want to do, what your data volume is, and where it lives.
  4. I'm still working in the area, in Hadoop itself and the code nearby.
Recent cloud-related activities include
  • HADOOP-8545: a  Swift Filesystem driver for OpenStack. This is something everyone running Hadoop on Rackspace or other OpenStack clusters will want. This week two different implementations have surfaced, getting them merged together is going to be the next activity,
  • WHIRR-667: Add whirr support for HDP-1 installation
  • Ambari with Whirr. Proof of concept more than anything else.
  • Jclouds and Rackspace UK throttling. Adrian Cole managed to reduce the impact of issue-549, which is good as I don't really want to get sucked into a different OSS codebase,
  • Other things that I'm not going to talk about -yet. 
That's why people should vote for me. The other talks will be about "how we got Hadoop to work in a virtual world" -mine will be about how we improved Hadoop to work in a virtual world.

(*) ps, for anyone planning the many-email-accounts approach, remember that the email addresses are something we reviewers can look at, and many sequential accounts all doing three votes to a single talk will show up as "statistically significant". Russ has the data, he likes his analyses. He may even have the IP addresses.

[Photo: an interview with Page 6 Guy at ApacheCon]


An Intro to Contributing to Hadoop

Together the ants shall conquer the elephant

Jeff Bean of Clouder has stuck up a video on contributing to Hadoop, which is a reasonable introduction to JIRA-centric development.

Process-wise, there's a few things I'd add:
  • Search for the issue or feature before you file a new bug.The first line of a stack trace is a great search term, though it's a bit depressing to find the only other person to find it was yourself 18 months earlier, and you never fixed in then either.
  • It's harder to get committer rights on Hadoop than most other projects, because the barrier to effort and competence is high. You pretty much have to work full time on the project. Posting four JIRAs and then asking to get committer access is unrealistic. And it doesn't bring much to the table except bragging rights. 
  • The bit at 16:20 where Jeff said "email other contributors to get eyes" was in fact an error. He meant to say "email wittenauer to get constructive feedback on your ideas" -nobody else welcomes such emails, and actually talking on the -dev list is better.
  • I'd also emphasise the "watch issue" button. If there is something you care about, hit the watch button to get emails whenever it is updated.
  • When you file a bug, include stack traces, kill -QUIT thread dumps, nestat and lsof details for the process in question; anything else. NOT: JPG screen shots of your Dos console. That flags up that you are probably out your depth when it comes to getting JAVA_HOME set, let alone discussing the impact of VM clock drift on consensus protocol-based distributed journalling systems.
  • When you file your bug, your rating: critical, major, etc, differs from everyone else. Mine are normally minor or trivial. If they only affect you: minor. Easy to fix: trivial. 
  • Don't file bugs about "I couldn't get Hadoop to install". Those bugs will be closed as invalid; posts on it to the -dev lists silently ignored. Go to the user lists. 

I was a bit disappointed by the claim that "the apache artifacts aren't stable, you need CDH" and the message that there is "the community" and "cloudera engineers", the latter being the only people who make Hadoop enterprise-ready. As well as Hortonworks, there are companies like IBM, Microsoft and VMWare working on making sure their customers' needs are met -and testing the Apache releases to make sure they're up to a state where you can use them in production.(*)

This "we are the engineers" story falls over at 07:00 when the walk through of the (epic) HA NN work, my colleagues Sanjay, Suresh and Jitendra all get a mention. Because Hadoop is a community project -one that involves multiple companies working together on Hadoop -as well as individuals and small teams. The strength of the Hadoop codebase comes from the combined contributions from everyone. Furthermore, having a no-single-vendor open source project, with public artifacts you can pick up and use, adds a strategic advantage to that codebase. Hadoop is not MySQL or OpenJDK -open source with secret bits that the single vendor can charge for. There's a cost to that -more need to develop a consensus, which is why I encourage people using Hadoop in production systems to get on the -dev lists, regardless of how Hadoop gets to your servers. Participation in those discussions gives you a direct say in the future direction of the project.

Overall though, not a bad intro to how to get started in the development. It makes me think I should do a video of my intro to hadoop-dev slides, which looks less at JIRA and more about why the development process is as it is, and how we could improve it. Someone else can do the "why Maven is considered a good tool for releasing Hadoop" talk -all I know is that I have to to a "mvn install -DskipTests" every morning to stop maven trying to go to the apache snapshot repo to download other people's artifacts, instead of the ones I build the day before.

(*) Yes, I know that Hadoop 1.1.1 is being replaced with a 1.1.2 to backport a deadlock show-stopper, but that's a very rare case -and shows that we do react to any problem in the stable branch that is considered serious.

[Photo, "together the ants shall conquer the elephant", alongside the M32 in Easton].


AWS: why the bias towards US-east?

As MastodonC will point out, Amazon's US-East sites are the most polluting, not just because they have a high CO2 footprint, but because the coal they (and the other east coast) industries burn is polluting in other ways, such as sulphur. It's not as bad as, say, a steelworks (having had relatives living near  Ravenscraig Steelworks I can vouch for this), but as datacentres can be placed near other electricity sources, it's needless.

I intermittently use US-West-2, up in Oregon, where the melting snow creates electricity.

Crater Lake Tour 2012

Unfortunately there's an implicit bias in the AWS APIs towards US-East. Where's the default site for S3 Buckets? US-East. Where's the default site for EC2 instances? US-East. What is the default location for EMR jobs? The same -to the extent that the command line clients treats requesting a different site as "uncommon":

Uncommon Options
 --debug               Print stack traces when exceptions occur
 --endpoint ENDPOINT   EMR web service host to connect to
 --region REGION       The region to use for the endpoint
 --apps-path APPS_PATH Specify s3:// path to the base of the emr public bucket to use. e.g s3://us-east-1.elasticmapreduce

Because of all the implicit "us-east" bias, it becomes self reinforcing. Once you've got a bucket on S3 east, that's where you want to run your webapps otherwise you get billed for the remote bandwidth. Once you've got the webapps, that's where your logs go, hence even more reason to run your MR jobs on the same site: it's where your data lives.

Because it's the default location for stuff, it's also the default location for people serving up data on the site: RPM and Maven repositories, public datasets. This pushes you towards that location so as to avoid the costs of downloading that data from other sites, as well as the speed gain.

Why the bias? Either it's where the the majority of servers lie, or through a combination of cost of electricity, site PUE and bandwidth, it's got the lowest operating costs -hence the most profit per CPU-hour, MB stored or MB downloaded.

That's a shame, because amazon themselves have better options. They're being crucified by Parliament over their tax avoidance strategies -it'd be tactically wise to have something positive to talk about.

[Photo: Crater Lake & Mt Thielsen. Smoke is a forest fire blowing up from CA]


Crater Lake: T+11

Following on from my "Page Mill T+20" trip, in late August we ended up Crater Lake for the Corvallis "Mid Valley Bicycle Club" annual circuit of the lake.


Crater Lake

and 2012
Crater Lake Tour 2012
The colours are different as in 2012 the Lassen fires 80 miles to the south are adding a light smoke to the air.

The original picture was taken with a Sony camera, 2048*1536; 3 Megapixels. The resolution is less than my desktop monitor, which makes it appear grainy as a background.

in 2012:, the original size of 4000*3000 means four times as many dots; the panasonic compact has a leica lens and makes up for the loss of a viewfinder by the ability to display a grid + diagonals over the image, to increase P(horizontal(horizon)).

In August 2001, Bina was 5 months pregnant; now our son is 10 and did the loop on a tandem, working with Mike Wilson, who races in the PNW CX circuit in the category below his age to make it more challenging. That may seem to have given Alexander some help -but it also meant that he was made to do it at a fairly aggressive pace, with none of this resting business.

Crater Lake Tour 2012
I did it on a borrowed MTB, with knobbly tires, and took a couple of detours to add 12+ miles to my route. Even so, compared to Alexander, I look suspiciously tired.

Crater Lake Tour 2012

I got back to the campground (because of those detours, honest) about a hour after him, tired, needing my rest and refreshments.

What is the ten year old doing? Running around chasing chipmunks. Then he comes over and tries to steal my beer.

Crater Lake Tour 2012

That's it then -isn't it? I may as well retire now.

Were it not for the fact that university education is becoming so expensive that my son will need a large amount of cash to get through it then I have no further contribution to make towards my DNA's survival.


And now: the People's Republic of Bristol

There was an election round England and Wales yesterday. Mostly it was for a new position: Police Commissioner, which was so uninspiring that one polling station in Newport, Wales, had a turnout of exactly zero.
Stokes Croft

In Bristol, we had something else: a Mayoral Election -one decided by a first choice/second choice voting system. The three main parties, some of the "troublemaker parties" -greens, Respect. And some independents, including one who lives in a van near stokes croft.

The results are in, and today we have something new in the city: an Independent Mayor.

I've met George Ferguson a couple of times -he's done a lot of the city, and, as they say , could "organise a piss-up in a brewery" -as he owns one of the local brewerys.

This could show a profound change: the locals would rather have someone in charge who wasn't beholden to a party line coming from London, who stated clearly that he'd be appointing his cabinet (from the existing councillors?) on merit, not just from the subset of those from a single party.

There are some other factors at play: a large proportion of voters from Liberal Democrat strongholds appear to have gone for George Ferguson -and those areas had the highest turnout. My own ward was at the 20% turnout -and when I dropped round to leave our two postal vote envelopers they were pleasantly surprised. As an attempt to raise awareness and interest in elections, it's failed.

It'll be interesting to see how having an independent works out. Patronage has always been one of the ways a political party achieves loyalty, and I wonder how many people in the council will be working for him, rather than against him.

In the meantime -I shall head down to the Canteen, Stokes Croft, and have one of his beers there.


A Hadoop Standards Body? It's called the Apache Software Foundation

I am writing this on the ICE502 train from Mannheim to Frankfurt. To my left, my friend Paolo Castagna pages through the emails from Cloudera HQ that are slowly trickling into his phone; I'm out of network range so can't go over the small-kids (kleinerkinder) compartment and skype in to a Hortonworks team meet.

We are on our way back from ApacheCon EU.
Zooming in

Over the last week, the topics of the talks I've attended have included (and omitting my own): Cassandra development, RDF processing in Apache Hadoop (ask Paolo there), Logging futures, post-Apache Maven build tools, Apache Open-Office cloud integration, Cloud Stack, Apache HBase status quo -Lars show how all the HDFS work we've been doing is really going to benefit Apache HBase there, NoSQL ORM, Apache Mahout, and many others. A large proportion of the Apache Hadoop Datacentre Stack is there -and we can sit down and discuss issues. It may be an internal issue: how to move away from commons-logging; it may be something cross project, such as how HDFS could let HBase explicitly request a block placement policy for each region server that kept all replicas on the same rack., or it could be something indirectly relevant like Apache Open Office slideshow improvements.

We've been treated to slides from Steve Watt of HP showing their prototype Arm-64 server systems, which will offer tens of servers in a 2U unit -a profound achievement. We've been treated to some excellent beer at the Adobe reception, which went from 18:00 until we were evicted at 21:00.

I met lots of people, some I knew, some I'd never met face to face before, some who were complete strangers until this week. We've been in the same talks, eaten at the same tables, drunk beer in the two restaurants and the cafe in this town, discussing everything from OSGi classloading in Apache Karaf, Jumbo Ethernet frames and what to do when remains of a decomposing whale ends up in your datacentre. Those people I was in the cafes included Lars George (Cloudera), Steve Watt (HP), Isabel Drost (Nokia), and three people who had a whale-related incident in their facility.
A whale? a whale?

Not once did anyone say: "Let's give some standards body the Apache Hadoop trademark and the right to define our APIs as well as the exact semantics of the implementation!"

Nobody said that. Not even whispered it.

Because from the open source perspective, it makes no sense whatsoever. The subject that did come up was "Jackson versioning grief -which relates to an open JIRA.

I gave a talk saying there is lots of work, and pointing people at svn.apache.org, and issues.apache.org , saying "get involved" -and discussing how to do so.

Key things to do
  • gain trust by getting on the lists and being visible (and competent, obviously)
  • help review other people's patches than just your own
  • don't try and do big things in Apache HDFS (risk of data lost) or Apache MapReduce (performance and scale risks).
What I did emphasise is that we do want more people helping -and that we need to improve how this is done. I did not suggest that we could do this through "under an industry forum—either an established group or one that is specifically focused on big data.".

What I suggested was -and these are entirely personal opinions -
  1. some mechanism for mentoring in external development projects, so that they don't fail, get neglected, or appear without any warning -and creating integration problems.
  2. better distributed development, so that those of us outside the Bay Area can be involved in the development. Google+ events, more pure-online meetings in various timezones. The YARN event that Arun organised is something I want to praise. here: we remote attendees got webex audio and remote slideshare. Even so it was very late in the EU evenings and there's always an imbalance between people in the room -the visible, vocal audience, and people down the speaker phone.
  3. better patch integration through Git and Gerrit. Even if svn is the normative repo, we should be able to accept patches as pull requests that go through Gerrit review; people can update their patches trivially through merging trunk with their branch and pushing out their branch to a public repo.
I also mentioned tests. Not just tests of new features -where we are obsessive about "no features without tests", but in improving the coverage of the system, and formalising the semantics of the system.

If there is ambiguity in the behavior of bits of Apache Hadoop, tests added to the Apache  source repository, svn.apache.org, define that behaviour. Regression testing the entire stack finds problems, which is why we love to do that -especially things like testing how repeated runs Apache HBase's functional tests suites succeed while our test infrastructure is triggers NameNode failover, or how the deployment of Yahoo!'s existing applications on the new MRv2 engine in YARN improves performance at those applications -while finding any regressions in MRv2 from the MRv1 runtime.

Testing against Apache Hadoop is the way to guarantee compatibility with Apache Hadoop -because the Apache Hadoop code is Hadoop.

At the root of the svn.apache.org/hadoop source tree, in the Apache tarballs and RPMs, and in those products that include the ASF artifacts or forks thereof is a file: LICENSE.TXT
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
What does that mean? It means:

Anyone is free to write whatever distributed filesystem they want, implement whatever distributed computing platforms on top if that they choose -but they cannot call it Hadoop.

There's a nice simple metric here:

If you can't file bug reports against something in issues.apache.org, it's not an apache product, and hence not Apache Hadoop

For that reason: I'm not convinced that the Hadoop stack needs to care about the compatibility concerns of people trying to produce alternative platforms, any more than Microsoft needs to care about the work in Linux to run Windows device drivers.


A WS-I standards body for Hadoop? -1 to that

see no evil at sunset

There's an article from IBM which argues that Hadoop needs to copy WS-I and Oasis to have a set of standards:
In the early 2000s, the service-oriented architecture world didn’t begin to mature until industry groups such as OASIS and WS-I stabilized a core group of specs such as WSDL, SOAP, and the like.
I despair. Anyone who credits OASIS and WS-I for this does not know their history -or is trying to rewrite it.

The initial interop came from the soapbuilders mailing list, of which Sam Ruby (IBM, ASF),  Davanum Srinivas , CA->WSO2->IBM and dims at apache, Glen Daniels (at the time, Macromedia + apache), all played key parts. Anyone in IBM curious about Soapbuilders should ask Sam about it.

Soapbuilders was the engineers: finding problems, fixing them. Agile, fast, focused on problems, fast turnaround in the evolving SOAP stacks.

It died. Killed by WS-I

WS-I wasn't the engineers, it was the standards people. Suddenly the battle lines were drawn over whose idea was going to be standardised. Take the problem of shipping binary data around. Base-64? works, but inefficiently. Microsoft's DIMESoap with AttachmentsMTOM?  WS-I was where the battles were fought, sometimes won, sometimes solved with "all of them"

I have been known to express opinions on the cause of interop problems in SOAP; I'm not going to revisit that, except to note that the focus of SOAP interop settled on Java<->.NET interop, which was addressed not by standards bodies but by plugfests, the standards themselves being too vague to cover the hard issues, especially defining which proper subsets of WS-* different stacks would support, and which proper subsets of WSDL could be generated and parsed. Ideally the parseable-subset was superset of the generateable subset, but, well, there can always be surprises. Those discussions would end up on soapbuilders, where maybe the developers could fix them without the standards team getting in the way.

I'm going to pick on one specification as the worst of all standards. WS-A.
  1. Replaced a simple concept "URL" with an arbitrary lump of XML.
  2. Went through an iterative development process that resulted in multiple versions
  3. Used xml namespaces to identify the versions. 
  4. Used three namespaces: 2003, 2004, 2005/08 to identify four different versions. The 2005/08 one has both an "interim' and "final" release.
That's what happens when you put a standards body like WS-I in charge of something as simple as a URL. They not only make a vast mess of it, you get a vast mess in different XML namespaces.

As for Oasis? WS-RF. That's all I need to say for anyone involved in; WS-*, or any of the grid proposals. A standard for managing "resources" -leased things at the end of WS-A addresses, across a network. A standard that managed to include two different WS-A versions in it.

Think about that for a minute. You have a "standard" that is produced by a standards body that is somehow affiliated with the UN and has some official status, yet they cannot push out one of the WS-* specification suites without incorporating different concepts of "how to address a SOAP endpoint" in the context of that single 'coherent' suite of documents -end up including non-normative drafts as well as the final things.

I am not going to make any statements on Hadoop standardisation in this context as it will probably be taken for an official stance rather than my personal opinions.  There is a section on Hadoop Compatibility that I wrote in Defining Hadoop; it sounds like some people and organisations ought to read that article.

 I do want to close with the following point
WS-I and Oasis were not bodies capable of producing "standard", where "standard" could be defined as a coherent, consistent and testable set of protocols. Instead they were places where vendors could push their own agendas, where the winners became the organisations capable of funding the most participants in the standards process, or those willing to do the most back-room deals with others.

The compromises needed to get anything out the door in even a hopelessly untimely manner produced an incoherent mess of XML namespaces, schemas and protocol issues that anyone working on SOAP stacks still has to deal with today.
REST did not win just because it was architecturally cleaner, because it was more powerful. It won because the alternative was the set of WS-* specifications that came out of WS-I and OASIS. Those organisations did not set WS-* on its route to global success; they condemned it to the niche of intra-enterprise Java/.NET communications, a decade after CORBA could have done the same thing better.

[photo: sunset on Nelson Street from St Michael's Hill]


Hadoop in Practice - "Applied Hadoop"

Recent train journeys to and from London have given me a chance to get the laptop out and read some of the collected PDFs of things I know I should read.

St Pauls Graffiti

I was given a PDF copy of Hadoop in Practice [Holmes, 2012] on account the fact of I'd intermittently been in the preview program -but I'd not looked at it in any detail until now. The (unexpectedly ) slow train journeys to and from London have been an opportunity to unfold the laptop and read it -and, at home, while I wait for EC2 do respond to whirr requests, to read it to the end -though not in as much detail as it deserves.

The key premises of this book are
  1.  You've read one of the general purpose "this is Hadoop" books -either the Definitive Guide or Hadoop in Action.
  2.  You want to do more with Hadoop.
  3.  You aren't concerned with managing the cluster.
  4.  You are concerned about how to integrate a Hadoop cluster with the rest of your organisation.
#3 means that there's nothing here on metrics, logging or low-level things. This is a book for developers and (yes) architects; less the operations people. Even so, the sections on integration with other systems, especially hooking up to log sources and databases that they need to know about.
Although it starts off with a quick overview of Hadoop and MapReduce, internals -such as how HDFS works- are relegated to appendices for the curious. Instead, the first detailed chapter looks at Ingress and Egress, or, so as not to scare readers, "Moving Data in and Out Hadoop", looking mostly at Flume, mentioning Chukwa and Scribe, and then into using Oozie-scheduled MR Jobs to pull data -something in an example in the book.

It doesn't delve into the aspects of this problem you'd need to worry about in production -data rates, the risk that MR pull jobs can either overload the endpoints or, unless they are split up well, can create imbalanced filesystems. Ops problems -or just too much to worry about right now.  What it does do is show why a workflow engine like Oozie is useful: to automate the regular work.

It glues the Hadoop ecosystem together. Want to parse XML? grab the XML input reader from Mahout.  Want to work with JSON? Twitter's Elephant Bird… etc. In fact the serialization chapter went into the depths XML and JSON parsing -and showed the problems, so justifying the next stage: Protobuf, Avro and Thrift.

There's a chapter on tuning problems which focuses more on code-level issues than hardware; this is where the line between ops & developers gets blurred. I think I'd have approached the problem in a different order, but the tactics are all valid.

Installation-wise, Alex points everyone at a version of CDH without LZO support; he has to talk people through building it. I don't know where Cloudera stand on that, as I know yum -y install hadoop-lzo works for HDP., and is up there with hadoop-native hadoop-pipes hadoop-libhdfs  and snappy as RPMs to add (update: see below). I'd have liked to seen bigtop as the centre of the universe, so be more neutral -something to hope for in the second edition

There's a few chapters on "data science" stuff: bloom filters, simple graph operations, R & Hadoop integration. I get the feeling that this section is very handy if you know your statistics and want to do work with a new toolset. The problem I have there is a personal one: I've forgotten too much of what I new about statistics. min, max, mean, Poisson, Gaussian and Weibull distributions;the notion of Markov chains are all concepts I know about -but ask me the equation behind a Poisson distribution and I stare as blankly at the questioner as our pet rabbit does when asked why he's been chewing power cables: there's no comprehension going on behind the eyeballs. I really need something that covers "statistics for people who used to know it vaguely -using R & Pig as the tools". There's a good argument for all developers to know more stats. This book isn't that -it does assume you know your statistics, at least better than I do.

Alex Holmes delves into MRUnit, which is a good way for unit testing individual operations. I tend to do something else: MiniMRCluster -but that one, while more authentic, can push problems onto different threads and so make it harder to identify root causes of problems -or isolate tests. MRUnit doesn't have that flaw, and nor does LocalJobRunner -which also gets coverage. The only thing that grated against me there was that the tests were done in Java -I've been using Groovy as my test language for the whole of 2012, and sheer verbosity of setting up lists in Java, and the crudeness of JUnit's assertions compared to Groovy's assert statements is painful to look at.

For anyone who's never used Groovy, its assert statement takes advantage of the compile-on-demand features of the language. On an assertion failure, the output walks through the entire expression tree, evaluates every part in turn and gives you the complete tree for your debugging pleasure. You can write one all-encompassing assertion, rather than break down each part of a large query into various assertNotNull, assertTrue, assertEquals calls -and if the single assert fails, there should be enough information for you to track down the cause.  That's why I like testing in Groovy, irrespective of whether or not your production code is in Java.

Other points: the ebook comes with your email address at the bottom, but no epub-esque security. This works on your Linux workstation as well as whatever tablet you choose to own -and relies on publicity & guilt to stop sharing. Which is probably a good strategy. That eBook comes with a feature I've never seen before: the page numbers in the contents match exactly the page numbers in the book -there must be some Framemaker magic that tells Preview &c the offset to apply after the user hits the "go to page" button.

Summary: this isn't book for newbies -precisely because it delves into Applied Hadoop. Even so, it's something you ought to have to hand, just so you aren't one of the people posting questions to user@hadoop that everyone else stares and generally refuses to answer., the "hello, I have got a pseudo-distributed cluster that cannot find localhost, here is the screenshot of the DOS console, please help!!!" -while forgetting to even include the screenshot of their hadoop.bat command line failing as they've forgotten to do something foundational like install Java.

Everyone but @castagna will learn something new -in fact maybe even him, because he needs something to read on test runs and trains to London (which is where I'm writing this, somewhere between Reading and London Paddington)

Update: Eric Sammer says of the LZO thing "hadoop-lzo in cdh, it's because of license concerns that we don't distrib."


Rethinking JVM & System configuration languages

Rethinking JVM & System configuration languages

I've been busy in Apache Whirr, with a complete service that installs HDP-1 on a set of cluster nodes -WHIRR-667; the source all up on Github for people to play with.. As a result someone asked me why I'm not using SmartFrog to provision Hadoop clusters
Having used it as a tool for a number of years, I'm aware of its flaws:

Specification language
  • Hard to track down where something gets defined
  • x-reference syntax a PITA to use and debug
  • Fuzzy distinction about LAZY eval vs. pre-deploy evaluation (LAZY is interpreted at deployment, but 'when' is ambiguous)
  •  RMI is wrong approach: brittle, often undertested in real world situations, & doesn't handle service restarts as references break.
  •  Wire-format serialized Java objects; the Object->Text->Parse->Object serialization proved surprisingly problematic (not defining the text encoding didn't help)
  •  Security so fiddly that we would often turn it off.
  •  Doesn't work unless Java is installed and network up -so no so good for basic machine setup from inside the machine itself, only outside-in (which is partly what Whirr does.
  •  Java doesn't let you get at many of the OS-specific details (permissions, process specifics); you end up hacking execs to do this.
  •  The way you imported other templates (#import keyword) was C-era -multiple imports would take place, the order in which they were loaded mattered.
  •  Shows its age -doesn't use dependency injection and becomes hard to work with (NB: whirr doesn't inject either)
In defence:
  •    it's not WS-*
  •    language better than XML (especially spring XML)
  •    good for writing distributed tests in
  •    Most XML languages insert variable/x-ref syntaxes in different ways (ant, maven, XSD, ...); SF has a formal reference syntax that doesn't change.
  • Being able to x-ref to dynamic data as well as static is powerful, albeit dangerous as the values can vary depending on where you resolve the values, as well as changing per run. And they stop you doing more static analysis of the specification.
  • Being able to refer to string & int constants in java source convenient too (classpath issues notwithstanding). Example, I could say :
serviceName: CONSTANT org.smartfrog.package.Myclass.SERVICE;

    The constant would then be grabbed from source. This may seem minor, but consider how often string constants are replicated in configuration files as well as source -and how a typo on either side creates obscure bugs. Eliminating that duplication reduces problems.
Looking at Whirr I can see how the two-level property file config design has limits (all extended services need to have their handlers declared in every config that uses them); templates of some form or other would correct this.

Ignoring the specific issue of VM setup (I need to write a long blog there criticising the entire concept of VM configuration as it is today, as it's like linking a C++ app by hand), I'd do things differently.
I think we need a post-properties, post-SF language, language: a strict superset of JSON, to which it could be compiled down to, property expansion in x-refs, ability to declare what attributes to inject/are mandatory, some Prolog  & Erlang-style list syntax to make list play easier. No dynamic values, because that prevents evaluation in advance.

"org.apache.whirr.hdp.Hdp1": org.apache.whirr.hadoop.Hadoop {
  "port": 50070,
  "logdir": "/var/log/${user}",
  //Extend the list of things to inject
  "org.smartfrog.inject": ["logdir" |super:"org.smartfrog.inject"]

The template being extended would be this:
"org.apache.whirr.hadoop.Hadoop": {
  "timeout": 60000,
  "port": 50070,
  "description": "hadoop",
  "org.smartfrog.class": "org.apache.whirr.service.hadoop.HadoopClusterAction",
  "org.smartfrog.inject": ["timeout", "port","install" "configure","user"],
  "org.smartfrog.require": [install", "configure"]

This would compile down to an expanded piece of JSON; as it would expand out, you could use it as a pre-JSON anywhere.
"org.apache.whirr.hdp.Hdp1":  {
  "timeout": 60000,
  "port": 50070,
  "description": "hadoop",
  "logdir": "/var/log/mapred",
  "org.smartfrog.inject": ["logdir" ,"timeout", "port","install" "configure","user"],
  "org.smartfrog.class": "org.apache.whirr.service.hadoop.HadoopClusterAction",
  "org.smartfrog.require": [install", "configure"]

  1. Importing is a troublespot -if you required fully qualified template references that mapped to specific package & file names, then you could just have a directory path tree (a la Python), possibly with zip file/JAR file bundling, and have the templates located there.
  2. I'm avoiding worrying about references; you'd need a syntax outside of strings to do this. It'd be a lot simpler than the SF one -fully qualified refs again, up/down the current tree, and to the super-template.
  3. No runtime references.
This syntax would be parseable in multiple languages; expandable to pure JSON would be the serialization format.
 A Java interpreter could take that and execute it, doing attribute injection where requested, failing if a required value is missing. Behind the scenes you'd have things that do stuff. I'd also look very closely about using Java at all, not just because I'm enjoying living in a half-post-Java world (Groovy for tests, GUIs &c), but because it

One other possibility here is that given it's JSON, embrace JavaScript more fully. What if you have not only the configuration params, but the option of adding .JS code in there too; you could have some fun there.

A cluster would be defined from this, here using  the same role-name concept that whirr uses with something like
"1 hadoop-namenode+hadoop-jobtracker, 512 hadoop-tasktracker+hadoop-datanode"

In a JSON template language you'd split things up more & use lists. It's more verbose, yet tunable.
Your cluster templates would extend the basic ones, so a cluster targeting EC2 could extend "org.apache.whirr.hdp.Hdp1" and add the EC2 options of AMI location, AWS cluster (West Coast 2, obviously), as well as authentication details, -or leave that to the end.  (There's some thoughts on mixins arising here, let's not go there, but I can see the value)

stevecluster:  ClusterSpec org.apache.whirr.hdp.Hdp1{
 "templates" : {
    "manager": {
       "Services": ["hadoop-namenode", "hadoop-jobtracker"],
       "Count": "1"
    "worker": {
       "Services": ["hadoop-tasktracker", "hadoop-datanode"],
       "Count": "255"

 A template without the login facts would need to be given the final properties on startup, props that could be injected as system properties.  (launch-cluster —conf stevecluster.jsx -Dstevecluster.ec2-ami=us-west2/ami5454). Properties set this way would automatically override anything set. That is, unless there is (somehow) support for a final attribute, which Hadoop likes to stop end users overwriting some of the admin-set config values with their own.  Without going into per-key attributes, you could have a special key, final, which took a list of which of the peer attributes were final. Actually, thinking about it more, @final would be better. Which would be hard to turn into JSON…

I could imagine using the same template language to generate compatible properties files today; this JSON-template stuff would just be a preprocess operation to generate a .properties file. That's making me thing of XSLT, which is even scarier than mixins.

I have no plans to do anything like this.

I just think a template-extension to JSON would be very handy, that some aspects of the SmartFrog template language are very powerful & convenient, irrespective of how they are used.
If someone were to do this, the obvious place in Apache-land would be in commons-configuration, as then everything which read its config that way would get the resolved config. That framework is built with hierarchical property files -think log4.properties, so resolves everything to a string and then converts to numbers afterwards. Lists and subtrees are likely to be trouble here -albeit fantastic.


After a week of OS/X mail, I'm (almost) pining for Outlook

Jamaica Street

Because networking from the hotel room last week was limited to a tethered 3G phone, I switched to a local email program for my messages, saving bandwidth and allowing offline use. That email program was Apple Mail for Mountain Lion. I then decided to follow through by using it for a whole seven days. Never again.

First, the UI isn't that great. The most glaring problem is that it's read/unread marker is a small pastel coloured blue dot to the left of the summary -a summary that has the sender in bold, the first couple of lines below. Every other modern email program (outlook, thunderbird, gmail, Y! mail, live.com) uses bold to mean "unread", but no Apple think "bold is for the sender", and "unread can be a barely visible dot to the side".

I could maybe get used to that. What I can't get used to is the way that emails on a gmail account seem to magically get deleted, even though I didn't delete them. It looks a bit like there is some auto-aging feature, but it deletes entire conversations, and does it without warning. Fortunately, very fortunately, gmail moved the messages to the "bin" folder, where I've been able to select them all and restore them to the inbox.

It's destroyed my trust in the program. If you can't rely on it not to discard conversations, you can't rely on it. At which point, it's in the do not use category.

What does that leave for the machine. Thunderbird, and, er, Outlook for OS/X. Having the latter installed, I'm considering that with IMAP ->Gmail. This could be some leftover from my time at HP; I am secretly missing large ppt-ware and msword documents hitting my inbox, maybe even missing the bizarre dialogs that would pop up.

My past issues with Outlook on Windows are well documented. That set of blog entries are the best argument as to why I shouldn't try Outlook on OS/X.

[photo: jamaica street; painting the lamp post to match the wall is becoming a tradition. It makes for better front-on photos if they've done it right, as the lamp post becomes invisible]


Strata EU: Logistics

I was at Strata EU last week -the first time ORA had hosted it in the EU.

Rather than go into the details, I'm going to look more at logistics. As a speaker, I got to stay in the hotel, the London Hilton Metropole, positioned where the Westway flyover rises off Edgware Road; 3.2 miles from where I grew up West Hampstead. The hoteal was very close to Paddington Station, ideally positioned for people coming from LHR or Bristol. Unfortunately, I was approaching from Portsmouth, so ended up at London Waterloo, South of the River.

A sunday evening was the ideal time to try out my Boris Bike key and cycle over there in the half hour of free-ride time you get. I first took the footbridge over the river to Charing Cross and then over Trafalgar Square before starting this -negotiating one of the bridges of death didn't appeal to me.

Getting the bicycle proved harder than I thought as the key wouldn't let me pick any up, plugging in by the touch screen brought up a page saying "call Transport for London". Which I did, above the traffic noise, and got someone who said I had yet to authenticate the key and had to do that there and then, including answering one of the security questions. Without getting the laptop out I couldn't do that, but managed to get by without it -and at the end being told the answer to the question, which involved Boris and some very negative phrases. They must get that a lot.

When I got the key in the post, TfL had included a nice map of central London showing all the bike rental sites. What that map didn't do was show sensible cycling routes. I could certainly get to the hotel via Regent St, Oxford St, Marble Arch and then Edgware Road -trivial routing- but not one that leaves you happy.

Instead I used the cycling layer of an Open Street Map viewer on my phone and meandered up the expensive parts of Westminster, over Hyde Park and then up, where I got fed up of repeatedly location checking and just went up Edgware Road instead, soon to dock the bike. Some blue lines on the TfL map would have been convenient.

This was my first trip on the TfL rental bikes, and they were a surprising.
  1. They are barges with awful friction and rolling resistance. I know they are powering some blinky LED lights, but even so they are slow. The gearing doesn't help either; it goes low but its top option would be low-mid-ring on my commuter.
  2. Those blinky lights are pretty awful, especially the front one. The only way you'd be seen against the illumination of chicken fast food restaurants on the Edgware Road would be as silhouette eclipses the chicken broilers in the front windows. You are in darkness on Hyde Park too -these are not for nightime MTB races.
  3. The brakes are dire too, with minimal reaction. I'd view that condition on my commuter as an emergency, not a normal state of affairs. I've realised why they are so bad -if they were set up the way mine are -light touch onto disk brakes- too many riders would be straight over the front bars as they (literally) hit that first junction. You just need to keep your speed down, especially given the inertia of the land-barge.
  4. Not a good turning circle.
Overall: not great, but they got me to where I wanted to be without going near the tube; given some time I'll learn my way around better.
The hotel was OK, except I couldn't get the wifi to work in my room, even when entering my (surname, room) info. A call to reception informed me that I actually needed to pay extra for wifi. That was like falling back in time. I almost expected them to tell me that there was a phone socket for my modem. I declined the option of room wifi and just flipped my mobile into Wifi hotspot mode to take advantage of my "unlimited" data option that I'd bought from 3 this month. Functional, albeit slow.

The room was on floor 10 -in the morning I could see the tower block near my house & from there orient myself to the trees behind, hence to the trees above. That's the closest I've been to it for 15 years. Maybe I should visit it sometime.

The next day, breakfast and conference. I found a good cafe nearby with Illy coffee and chocolate croissants -something to remember the next time I am in Paddington station waiting for a train.
The conference was fun -loitering near the booth meant I spent more time meeting other attendees than in talks -but the few I made were good. In particular: James Cheshire's visualisation talk showed some beautiful visualisations of data projected or animated onto maps of London; a talk on Cause and Effect really laid down how to do effective tests -a key point being a negative result is a result, so don't ignore it.

I also enjoyed Isabel Drost's talk on big data mistakes, where she got everyone to own up to getting things wrong -like creating too many small files, accidentally deleting the root tree of the FS, running jobs that bring down the cluster, etc. A lot of the examples credited someone called "Steve" -I have to own up to being this person. I consider breaking things to be an art. Indeed, I couldn't even watch her slides without having to file a bugzilla entry: https://issues.apache.org/ooo/show_bug.cgi?id=120767
cat and mouse
If there was one problem with the conference site -it was that rooms were too scattered. After day 1 you'd learned your way around, but it still took five minutes to get to each talk -cutting each talk down by five minutes. It also stopped you running out of a talk you didn't like and going to another one. Not that I'd do that -or expect anyone in the audience of my talk to do such a thing.


Ingress and Egress

Last week someone from British Telecom/BT came round to boost my networking, running Fibre to the Cabinet and then re-enabling the existing Copper-to-the-Home from there.
Upload statistics
As the graph shows, it's got a lot faster than the virgin cable: download has gone from 12.7 (vs. a promised 20 Mb/s) to ~54-55 Mb/s, a 4x improvement, while upload has gone from a throttled 2Mb/s to 15 Mbits -7x. That 7x upload speed is that I was really looking -both the ADSL and Cable offerings are weak here, with sky being the worst at 0.8Mb/s, which is pretty atrocious. The cable modem offering also suffered from collapsing under load in the evening, especially when the students were back (i.e. this time of year). I don't have that problem any more.

Now I can not only download things, I can upload them. In fact, this network is now so fast that you can see other problems. As an example, the flickr uploader used to crawl through each photo upload. Now it sprints up -so much so that the per-photo fixup at the end becomes the pause in the progress bar, not a minor detail.

Its on the downloads though, where problems arise -problems down to TCP and HTTP. HTTP likes to open a new connection on every GET/POST/PUT/HEAD/whatever operation. TCP has throttling built in to stop it flooding the network. Part of that throttling is slow-start: rather than streaming data at the full rate claimed by the far end, TCP slowly ramps its window size based on the acknowledgements coming back to it. Acknowledgements that depend on the far end getting back to the remote host -and hence the round trip time -not the bandwidth. Even though my bandwidth has improve in both directions, the distance to remote servers and the number of hops is roughly the same -only now that slow start is visible.

Take an example: the NetFlix progress bar at the start of a video. It begins, slowly filling up. Suddenly half way along it picks up speed and fills the rest of the bar in 1 second, compared to the 4-5 seconds for the first half.

What I am seeing there is latency in action.

It shows the real difference between 100Mb/s LAN and WAN connections at a sizeable fraction of that. 100MB/s LAN isn't too bad for pushing data between two boxes adjacent to each other -and ramp up time is neglible. Over a distance, its latency and round trip times that make short-lived TCP operations -of which HTTP GETs are a key example- way slower than they need to be.

Google have a paper discussing this and arguing for increasing the initial window size. For those of us with long-but-fat-pipes, this makes sense. I don't know about all those mobile things though.


PUE, CO2 and NYT

Crater Lake Tour 2012

The NY Times has published an article "exposing" the shocking power wasted in a datacentre. It's an interesting read, even if their metrics "1.8 trillion gigabytes" take work to convert into meaningful values, which, assuming they use the HDD vendor's abuse of the values G and T in their disks specs, work out as:
"2,000 gigabytes": ~2TB
"50,000 gigabytes": ~50TB.
"roughly a million gigabytes": ~1 PB.
"1.8 trillion gigabytes": ~1.8 Exabytes.
"76 billion kilowatt hours: 76e9 KWh = 76e6 MWh = 76e3 GWh

There's already a scathing rebuttal, which doesn't say much I disagree with.

One part of the NYT article involved looking round a "datacenter" and discovering lots of unused machines, services that only get used intermittently. I'm assuming this is some kind of enterprise datacentre, a room or two set up a decade ago to host machines. Those underused machines should be eliminated; their disk image converted to a VM and then hosted under a hypervisor. Result: less floor space, CPU power and HDD momentum wasted.

Those enterprise datacentres are the ones whose PUE tends to be pretty bad -because it's mixed in with the rest of the site's aircon budget, and not as significant & visible a cost as it is for the big facilities. Google, Amazon and Facebook do care about this; they are probably the people backing the ARM-based servers, such as those running Hadoop jenkins builds. What those vendors care about tends to be cost though: cost of HW, cost of power, cost of land, cost of packets.

What the article doesn't look at -but the folks at MastodonC will presumably cover at Strata EU- is not the energy cost of computation, but the CO2 cost, Those datacentres in VA, where Amazon US-East is, have awful CO2 footprints, being all coal-powered. That's why it's ironic that the NYT complains about Amazon's diesel generators being pollution -in a part of the world where mountain-top mining converts entire mountains into smoke. They'd have been better off looking at the CO2 footprint of the datacentres, and of the other industries in the area.

MastodonC's dashboard is why I'm storing data and spinning up t1.micro instances in US-West 2 -Oregon; lowest CO2 footprint of their US sites.

I was also kind of miffed as the paper's criticism of power lines "financed by millions of ordinary ratepayers". Surely freeways were "financed by millions of ordinary ratepayers", yet the NYT has never done a shocking critique of Walmart's use of them to ship consumer goods round in fuel-inefficient diesel trucks, despite the fact an energy efficient alternative (electric trains) have existed for decades.

One thing the NYT does hint at is the storage cost -and hence the power cost- of old email attachments. It makes me think that I should clean some of the old junk up. What they don't pick up on is the dark secret of Youtube: the percentage of videos that are of cats. If you want someone to blame, blame the phones that make taking such videos trivial, and the people who upload them.

[Photo: Crater Lake, OR. The sky is hazy as the forest fires in Lassen and west of Redding are bringing smoke up from CA].


My Hadoop-related Speaking Schedule

I'm back from the US, where I had lots of fun getting the HA HDP-1 stuff out the door -I know about Linux Resource Agents, and too much about Bash -though that knowledge turns out to be terrifyingly useful.

Here's a pic of me sitting outside a cabin in Yosemite Valley where we spent a couple of nights -Camp 4 wasn't on the permitted accommodation list this time.
Curry Camp Cabin, Yosemite

Some people may be thinking "cabin?" "Yosemite?" and "Isn't that where all those people caught Hantavirus and died?". The answer is yes -though they were in wooden-walled tent-things about 100 metres away, and the epidemology assessments show that even for them the risk is very small. The press like headlines like "20,000 people may be at risk" -missing the point that the larger the set of people "present" for the same number of "ill", the smaller P(ill | present). Which is good as P(die | ill)=0.4.

Even so,  I've had some good discussions with the family doctor and the UK Health Protection Agency, who did write a letter saying "if you show symptoms of flu within 6 weeks of visiting, get to a hospital for a blood test". As the doctor said "we don't get many cases of Hantavirus in Bristol", so it's not something they are geared up for. You know that when they start looking at the same web pages you've already read.

Well, we've got 1-2 weeks left to go. And it was excellent in Yosemite, though next time I'd stay more in Tolumne Meadows than in the valley itself (too busy), and maybe sort out the paperwork to go back-country. 


Assuming that I remain alive for the next fortnight, here are where I'm going to be speaking over the next few months.

Strata EU: Data Availability and Integrity in Apache Hadoop.

I've already done a preview of the talk at a little workshop in Bristol -the live demo of RHEL HA failover did work, so I hope to repeat it. I'll be manning the Hortonworks Booth and wearing branded T-shirts, so will be findable -though I plan to attend some of the talks. In particular, one of the people behind Spatial Analyis UK will be talking -and I just love their maps.

Big Data Con London, Hadoop as a Data Refinery.

Here I'll be exploring the "Data Refinery" metaphor as a way to visualise and communicate the role of the Hadoop stack in existing organisations.

ApacheCon EU, Introduction to Hadoop-dev.

I'm going to talk about the Hadoop development process, QA and testing, contributions. This isn't going be a basic "here's SVN", or a "Hortonworks and Cloudera can handle everything" talk, but one that looks at the current process -both strengths and weaknesses. As a committer who was not only on their own for some years, but still in a different TZ, I know the problems that arise. I believe it is essential for people using Hadoop in the field to get their feedback in, through JIRA, tests & patches. If there is one thing that I think needs work is to have a semi-formalised process for external projects to do mentored work relating to Hadoop. That's companies, individuals, interns and university research. All to often we don't know that someone is working on a feature until they turn up with something big that cuts across the projects -and at that point it's too late to shape, to open up to external input, or to even comprehend. Just as apache has an incubator, I think we need something structured -as the alternative is that this work falls on the floor and ends up wasted.


RE: Hello from Twitter!

My life is less interesting than it seems on facebook

Yet another LI approach, this one that didn't know who at Twitter was a Hadoop committer and therefore someone I knew/had consumed beer at the Highbury with.

What I will flag up here is Twitter's storage team's plans to build a new data platform from scratch. Not heard of that before.

Hi. I 'm afraid I must decline your invitation. I am having lots of fun at hortonworks building the future of Hadoop, and am not interested in any discussions.

While I am at it, can I point out

these state my reference policy for dealing with unsolicited approaches. A quick scan of the Hadoop committer list would identify who I have worked with at twitter, so who to approach me through more personally.

On 8/30/12 1:00 AM, Mike  wrote:
Hi Steve,

I lead passive candidate engagement for the core storage team at Twitter. I ran across your profile and I wanted to see if you may be open to a short conversation.  The storage team is working on a next generation big data platform we are building from scratch.  This will fundamentally change the way we store and analyze data.  Would you be available to speak sometime this week?

I look forward to your response.





Summit Approach
I guess I should undeclare all photos of me cycling in either US postal or T Mobil cycling tops, such as here where A. and I topped out Col de Madeleine (1993m) in 2009, having a meal at the top before descending -the first time my son had been over 40mph on a bicycle before. Those burley tagalongs corner well -the rack mount gives them a good CofG, but are not so good off road.

Regarding the TdF, I was down in Grenoble in '98, and almost drove up to Annecy to catch the stage there -I chose to head east and ride Galibier from both sides instead. That was a good choice, as that was the day the riders had a sit down strike "If you make the drug tests harder -make the tour easier".

This shows the problem. The riders want to win -but the TV companies wanted exciting television; the little towns wanted the intermediate sprints, as did that French betting company (PMC?), the sponsors wanted the TV coverage. EPO and blood doping were the dirty little secrets: undetectable drugs that meant from Indurain's era onwards, all winners of the TdF probably cheated.

It's hard to blame or criticise Lance Armstrong here, because, well, that was how the game went that year -and it wasn't just the cyclists who benefited.


Me: I just regret not taking an afternoon out from Geneva in 1988 to catch Lemond.


Page Mill Hill Work, T+20

This is me atop Page Mill; photo on Ilford B&W, hand developed:
Pagemill Summit 1992

Date? Summer 1992; spending a few weeks in Santa Clara.

I've just made it up Page Mill, the bike -my original mountain bike- is carrying about 6-8kg of surplus weight on the back.

Here's the scene again, this time on a digital camera that even knows where it is:
20 years on

Date: Summer 2012; spending a few weeks in Sunnyvale.

The 6-8kg of surplus luggage has moved off the bike into my body, making it impossible for me to leave it behind on rest days; the fringe on my hair has moved back, and the sunglasses hide the fact I look more tired.

Otherwise, not much has visibly changed.

Except that climb up Page Mill on the first photo was followed by a ride all the way up Skyline, dropping down to the pacific coast side of San Francisco where I eventually turned up at my motel to discover that my reservation had been voided & because of some event in the city if I wanted somewhere to stay I would have to sprint over to the YMCA in that fairly rough part off Market -Tenderloin?- where before I could get a room the people in front were complaining they'd been robbed through a window on the third floor. Needless to say the bike came in the room. Estimated distance: 90-100 miles, 3000-5000' of ascent.

The next day: over into Marin, over Mt Tamalpais on an off-road trail, where I got to enjoy overtaking people who didn't have luggage, then over to Pt Reyes and the Youth Hostel there; 70-80 miles, 5000-6000' of ascent.

These days, up Pagemill then S. on Skyline before descending to Cupertino and home is enough for me to declare victory (50 miles; 3000' of up), where I can then settle down and drink a beer pretending to myself that I've earned it.

That's the difference.


Welcome to Chaos

Welcome to Bristol

Netflix have published their "Chaos Monkey" code on Github; ASL Licensed. I have already filed my first issue, having looked through the code -an issue that is already marked as fixed.

Netflix bring to the world the original Chaos Monkey -tested against production services.

Those of us playing with failures, reliability and availability in the Hadoop world also need something that can generate failures, though for testing the needs are slightly different:
  1. Failures triggered somewhat repeatedly.
  2. Be more aggressive.
  3. Support more back ends than Amazon -desktop, physical, private IaaS infrastructures.
#1 and #2 are config tuning -faster killing, seeded execution.

#3? Needs more back ends. The nice thing here is that there's very little you need to implement when all you are doing is talking to an Infrastructure Service to kill machines; the CloudClient interface has one method:
  void terminateInstance(String instanceId);
That needs to be aided with something to produce a list of instances, and of course there's the per-infrastructure configuration of URLs and authentication.

My colleague Enis has been doing something for this for HBase testing.; independently I've done something in groovy for my availability work, draft package, org.apache.chaos. I've done three back ends :
  1. SSH in to a machine and kill a process by pid file.
  2. Pop up a dialog telling the user to kill a machine (not so daft, good for semi-automated testing).
  3. Issue virtualbox commands to kill a VM.
All of these are fairly straightforward to migrate to the Chaos Monkey; they are all driven by config files enumerating the list of target machines, plus some back-end specific options (e.g. pid file locations, list of vbox UUIDs).

Then there's the other possibilities: VMWare, fencing devices on the LAN, ssh in and issue "if up/down" commands (though note that some infrastructures, such as vSphere, recognise that explicit option and take things off HA monitoring). All relatively straightforward. 

Which means: we can use the Chaos Monkey as a foundation for testing how distributed systems, especially the Hadoop stack components, react to machine failover -across a broad set of virtual and physical infrastructures. 

That I see the appeal of.

Because everyone needs a Chaos Monkey.

[update 13:24 PST, fixed first name, thank you Vanessa Alvarez!]


Reminder to #Hadoop recruiters: do your research

It was only last week that I blogged about failing Hadoop recruiter approaches.

Key take-aways were
  • Do your research from the publicly available data.
  • Use the graph, don't abuse it.
  • Never try to phone me.
Given it is fresh in my blog, and that that blog is associated with my name and LI profile, it's disappointing to see that people aren't reading it.

Danger Vengeful God

A couple of days ago, someone trying to get me on Twitter looking for SAP testers, which is as relevant and appealing to me as a career opportunity in marketing.

Today, some LI email from Skype.com
From: James
Date: 26 July 2012 07:35
Subject: It's time for Skype.
To: Steve Loughran

Dear Steve,

Apologies for the unsolicited nature of the message; it seemed the most confidential way to approach you, although I shall try to contact you by phone as well.

I am currently working within Skype's Talent Acquisition team and the key focus in my role is to search and hire top talent in the market place. I am currently looking at succession planning both for now but also on a longer term plan. I would be keen to have a conversation with you about potential opportunities, and introduce myself and tell you more about Skype (Microsoft) and our hiring in London around "Big Data".

I look forward to hearing from you.
See that? He's promising to try and contact me by phone as if I'm going to be grateful. Yet last week I stated that as my "do that and I will never speak to you" policy.  Nor do I consider unsolicited emails confidential -as you can see.

I despair.

The data is there, use it. If you don't, well, what kind of data mining company are you?

For anyone who does want an exciting and challenging job in the Hadoop world, one where your contributions will go back into open source and you will become well known and widely valued, can I instead recommend one of the open Hortonworks positions? We. Are. Having. Fun.

As an example, here is a scene at yesterday's first birthday BBQ:
Garden Party

Eric14, co-founder and CTO on the left, Bikas on the right, wearing buzz lightyear ballons after food and beverages; nearby Owen is wearing facepaint. Bikas is working on Hadoop on Windows, and has two large monitors showing Windows displays in his office, alongside the Mac laptop -the outcome of that work is not just that Hadoop will be a first class citizen in the Windows platform, you'll get excellent desktop and Excel integration. Joining that team and you get to play with this stuff early -and bring it to the world.
I promise I will not phone anyone up about these jobs.


Defending HDFS

It seems like everyone is picking on HDFS this week.

Limited Press @ #4 Hurlingham Road

Some possibilities
  1. There are some fundamental limitations of HDFS that suddenly everyone has noticed.
  2. People who develop filesystems have noticed that Hadoop is becoming much more popular and wish to help by contributing code, tests and documentation to the open source platform, passing on their experiences running the Hadoop application stack and hardening Hadoop to the specific failure modes and other quirks of their filesystems.
  3. Everyone whose line of business is selling storage infrastructure has realised that not only are they not getting new sales deals for hadoop clusters
  4. Hadoop HDFS is making it is harder to justify the prices of "Big Iron" storage.
If you look at the press releases, action two, "test and improve the Hadoop stack" isn't being done by the "legacy" DFS vendors. These are the existing filesystems that are having Hadoop support retrofitted - usually by adding locality awareness to SAN-hosted location independence and a new filesystem driver with topology information for Hadoop. A key aid to making this possible is Hadoop's conscious decision to not support full Posix semantics, so it's easier to flip in new filesystems (a key one being Amazon S3's object store, which is also non-Posix).

I've looked at NetApp and Lustre before. Whamcloud must be happy that Intel bought them this week, and I look forward to consume beer with Eric Barton some time this week. I know they were looking at Hadoop integration -and have no idea what will happen now.

GPFS, well, I'll just note that they don't quote a price, instead say "have an account team contact you". If the acquisition process involves an account team, you know it wont be cents per GB. Client-side licensing is something I thought went away once you moved off Windows Server, but clearly not.

CleverSafe. This uses erasure coding as a way of storing data efficiently; it's a bit like parity encoding in RAID but not quite, because instead of the data being written to disks with parity blocks, the data gets split up into blocks and scattered through the DFS. Reading in the blocks involves pulling in multiple blocks and merging them. If you over-replicate the blocks you can get better IO bandwidth -grab the first ones coming in and discard the later ones.

Of course, you then end up with the bandwidth costs of pulling in everything over the network -you're in 10GbE territory and pretending there aren't locality issues, as well as worrying about bisection bandwidth between racks.

Or you go to some SAN system with its costs and limitations. I don't know what CleverSafe say here -potential customers should ask that. Some of the cloud block stores use e-coding; it keeps costs down and latency is something the customers have to take the hit on.

I actually think there could be some good opportunities to do something like this for cold data or stuff you want to replicate across sites: you'd spread enough of the blocks over 3 sites that you could rebuild them from any two, ideally.

Ceph: Readingthe papers, it's interesting. I haven't played with it or got any knowledge of its real-world limitations.

MapR. Not much to say there, except to note the quoted Hadoop scalability numbers aren't right. Today, Hadoop is in use in clusters up to 4000+ servers and 45+PB of storage (Yahoo!, Facebook). Those are real numbers, not projections from a spreadsheet.

There are multiple Hadoop clusters running at scales of 10-40 PB clusters, as well as lots of little ones from gigabytes up. From those large clusters, we in the Hadoop dev world have come across problems, problems we are, as an open source project, perfectly open about.

This does make it easy for anyone to point at the JIRA and say "look, the namenode can't...", or "look, the filesystem doesn't..." That' something we just have to recognise and accept.

Fine: other people can point to the large HDFS clusters and say "it has limits", but remember this: they are pointing at large HDFS clusters. Nobody is pointing at large Hadoop-on-filesystem-X clusters, for X != HDFS, -because there aren't any public instances of those.

All you get are proof of concepts, powerpoint and possibly clusters of a few hundred nodes -smaller than the test cluster I have access to.

If you are working on a DFS, well, Hadoop MapReduce is another use case you've got to deal with -and fast. The technical problem is straightforward -a new filesystem client class. The hard part is solving the economics problem of a filesystem that is designed to not only store data on standard servers and disks -but to do the computation there.

Any pure-storage story has to explain why you also need a separate rack or two of compute nodes, and why SAN Failures aren't considered a problem.

Then they have to answer a software issue: how can they be confident that the entire Hadoop software stack runs well on their filesystem? And if it doesn't, what processes have they in place to get errors corrected -including cases where the Hadoop-layer applications and libraries aren't working as expected?
Another issue is that some of the filesystems are closed source. That may be good from their business model perspective, but it means that all fixes are at the schedule of the sole organisation with access to the source. Not having any experience of those filesystems, I don't know whether or not that is an issue. All I can do is point out that it took Apple three years to make the rename() operation atomic, and hence compliant with POSIX.. Which is scary as I do use that OS on my non-Linux boxes. And before that, I used NTFS, which is probably worse.

Hadoop's development is in the open; security is already in HDFS (remember when that was a critique? A year ago?), HA is coming along nicely in the 1.x and 2.x lines. Scale limits? Most people aren't going to encounter them, so don't worry about that. Everyone who points to HDFS and says "HDFS can't" is going to have to find some new things to point too soon.

For anyone playing with alternate filesystems to hdfs:// file:// and s3://, then -here are some things to ask your vendor:
  1. How do you qualify the Hadoop stack against your filesystem?
  2. If there is an incompatibility, how does it get fixed?
  3. Can I get the source, or is there an alternative way of getting an emergency fix out in a hurry?
  4. What are the hardware costs for storage nodes?
  5. What are the hardware costs for compute nodes?
  6. What are the hardware costs for interconnect.
  7. How can I incrementally expand the storage/compute infrastructure.
  8. What are the licencing charges for storage and for each client wishing to access it?
  9. What is required in terms of hardware support contracts (replacement disks on site etc), and cost of any non-optional software support contracts?
  10. What other acquisition and operational costs are there?
I don't know the answers to those questions -they are things to ask the account teams. From the Hadoop perspective:
  1. Qualification is done as part of the release process of the Hadoop artifacts.
  2. Fix it in the source, convince someone else too (support contracts, etc)
  3. Source? See http://hadoop.apache.org/
  4. Server Hardware? SATA storage -servers depend on CPU and RAM you want.
  5. Compute nodes? See above.
  6. Interconnect? Good q. 2x1GbE getting more popular, I hear. 10 GbE still expensive
  7. Adding new servers is easy, expanding the network may depend on the switches you have.
  8. Licensing? Not for the Open Source bits.
  9. H/W support: you need a strategy for the master nodes, inc. Namenode storage.
  10. There's support licensing (which from Hortonworks is entirely optional), and the power budget of the servers.
Server power budget is something nobody is happy about. It's where reducing the space taken up by cold data would have add-on benefits -there's a Joule/bit/year cost for all data kept on spinning-media. The trouble is: there's no easy solution.

I look forward to a time in the future when solid state storage competes with HDD on a bit by bit basis, and that cold data can be moved to it -where wear levelling matters less precisely because it is cold- and warm data can live on it for speed of lookup as well as power. I don't know when that time will be -or even if.

[Artwork, Limited Press on #4 Hurlingham Road. A nice commissioned work.]