Greylisting - like blacklisting only more forgiving

How not to fix a car

Reading the paper  The φ Accrual Failure Detector has made me realise something that I should have recognised before: blacklisting errant nodes it too harsh -we should be assigning a score to them and then ordering them based on perceived reliability, rather than a simple reliable/unreliable flag.

In particular: the smaller the cluster, the more you have to make do with unreliable nodes. It doesn't matter if your car is unreliable, if it is all you have. You will use it, even if it means you end up trying to tape up an exhaust in a car park in Snowdonia, holding the part in place with a lead acid battery mistakenly placed on its side.

Similarly, on a 3-node cluster, if you want three region servers on different nodes, you have to accept that they all get in, even if sometimes unreliable.

This changes how you view cluster failures. We should track the total failures over time, and some weighted moving average of recent failures -the latter to give us a score of unreliability, giving us a reliability score of 1-reliability, assuming I can normalise unreliability to a floating point value in the range 0-1.

When specifically requesting nodes, we only ask for those with a recent reliability over a threshold; when we get them back we first sort for reliability and try to allocate all role instances to the most reliable nodes (sometimes YARN gives you more allocations than you asked for). We may have some allocations on nodes > the reliability threshold.
That threshold will depend on cluster size -we need to tune that based on the cluster size provided by the RM (issue: does it return current cluster size or maximum cluster size).

What to do with allocations above the threshold?
  1. discard them, ask for a new instance immediately: high risk of receiving the old one again
  2. discard them, wait, then ask for a new instance: lower risk.
  3. ask for a new instance before discarding the old one the soonest of (when the new allocation comes in, some time period after making the request). This probably has the lowest risk precisely because if there is capacity in the cluster we can't get that old container, we'll get a new one on an arbitrary node. If there isn't capacity, when we release the container some time period after making the request, we get it back again. That delayed release is critical to ensuring we get something back if there is no space.
What to do if we get the same host back again? Maybe just take what we are given, especially in case #3 and we know that the container was released after a timeout. It'll be above the threshold, but let's see what happens -it may just be that now it works (Some other service blocking a port has finished, etc). And if not, it gets marked as more unreliable.

If we do start off giving all nodes a reliability of under 100%, then we can even distinguish "unknown" from "known good" and "known unreliable". This gives applications a power they don't have today -a way to not trust the as-yet-unknown parts of a cluster

 If using this for HDD monitoring, I'd certainly want to consider brand new disks as less than 100% reliable at first, and try to avoid storing data in >1 drive below a specific reliability threshold, though that just makes block placement even more complex

I like this design --I just the need the relevant equations


Hoya as an architecture for YARN apps


Someone -and I won't name them- commented on my proposal for a Hadoop Summit EU talk, Secrets of YARN development: "I am reading YARN source code for the last few days now and am curious to get your thoughts on this topic - as I think HOYA is a bad example (sorry!) and even the DistributedShell is not making any sense."

My response: I don't believe that DShell is a good reference architecture for a YARN app. It sticks all the logic for the AM into the service class itself, doesn't do much on failures, avoids the whole topics of RPC and security. It introduces the concepts but if you start with it and evolve it, you end up with a messy codebase that is hard to test -and you are left delving into the MR code to work out how to deal with YARN RM security tokens, RPC service setup, and other details that you'd need to know in production

Whereas Hoya
  • Embraces the service model as the glue to building a more complex application. Shows my SmartFrog experience in building workflows and apps from service aggregation.
  • Completely splits the model of the YARN app from the YARN-integration layer, producing a model-controller design. Where the model can be tested independently of YARN itself.
  • Provides a mock YARN runtime to test some aspects of the system --failures, placement history, best-effort placement-history reload after unplanned AM failures --and lays the way for simulating the model can handle 1000+ clusters.
  • Contains a test suite that even kills HBase masters and Region Servers to verify that the system recovers.
  • Implements the secure RPC stuff that Dshell doesn't and which isn't documented anywhere that I could find.
  • Bundles itself up into a tarball with a launcher script -it does not rely on Hadoop or YARN being installed on the client machine.
So yes, I do think Hoya is a good example

Where it is weak is
  1. It's now got too sophisticated for an intro to YARN.
  2. I made the mistake of using protobuf for RPC which is needless complexity and pain. Unless you really, really want interop and waste a couple of days implementing marshalling code I'd stick to the classic Hadoop RPC. Or look at Thrift.
  3. I need to revisit and cleanup of bits of the client side provider/template setup logic.
  4. We need to implement anti-affinity by rejecting multiple assignments to the same host for non-affine roles.
  5. It's pure AM-side, starting HBase or Accumulo on the remote containers, but doesn't try hooking the containers up to the AM for any kind of IPC.
  6. We need to improve its failure handling with more exponential backoff, moving average blacklisting and some other details. This is really fascinating, and as Andrew Purtell pointed me at phi-accrual failure detection, is clearly an opportunity to some interesting work.
I'd actually like to pull out the mock YARN stuff out for re-use --same for any blacklisting code written for long-lived apps.

I also filed a JIRA "rework DShell to be a good reference design", which means implement the MVC split and add a secure RPC service API to cover that topic.

Otherwise: have a look at the twill project in incubation. If someone is going to start writing a YARN app, I'd say: start there. 


My policy on open source surveys: ask the infrastructure, not the people

An email trickling into my inbox reminds me to repeat my existing stance on requests to complete surveys about open source software development: I don't do them.


The availability of the email address of developers in OSS projects may make people think  that they could gain some insight by asking those developers questions as part of some research project, but consider this
  1. You won't be the first person to have thought of this -and tried to conduct a survey.
  2. The only people answering your survey will be people who either enjoy filling in surveys, or who haven't been approached, repeatedly before.
  3. Therefore your sample set will be utterly unrealistic, consisting of people new to open source (and not yet bored of completing surveys), or who like filling in surveys.
  4. Accordingly any conclusions you come to could be discounted based on the unrepresentative, self-selecting sample set.
The way to innovate in understanding open source projects -and so to generate defensible results-  is to ask the infrastructure: the SCM tools, the mailing list logs, the JIRA/bugzilla issue trackers. There are APIs for all of this.

Here then are some better ideas than yet-another-surveymonkey email to get answers whose significance can be disputed:
  1. Look at the patch history for a project and identify the bodies of code with the highest rate of change -and the lowest. Why the differences? Is the code with the highest velocity the most unreliable, or merely the most important?
  2. Look at the stack traces in the bug reports. Do they correlate with the modules in (1)?
  3. Does the frequency of stack traces against a source module increase after the patch to that area ships? or does it decrease? That is, do patches actually reduce the #of defects, or as Brooks said in the Mythical Man Month, simply move around. 
  4. Perform automated complexity analysis  on source. Are the most complex bits the least reliable? What is their code velocity?
  5. Is the amount of a discussion on a patch related to the complexity of the destination or the code in the patch?
  6. Does that complexity of a project increase of decrease over time?
  7. Does the code coverage of a project increase or decrease over time?
See? Lots of things you could do -by asking the machines. This is the data-science way, not asking surveys against a partially-self-selecting set of subjects and hoping that it is in some way representative of the majority of open source software projects and developers.

[photo: ski lifts in the cloud, Austria, december 2013]