2015-10-28

Concept: Maintenance Debt


We need a new S/W dev term Maintenance Debt.

Searching for this, the term shows it only crops up in the phrase Child Maintenance Debt -we need it for software.

OR Mini Loop Tour

Maintenance Debt: the Technical Debt a project takes on when they fork an OSS project.

All OSS code offers you the freedom to make a fork of the code; the right to make your own decisions as to the direction of the code. Git makes that branching a one line action git branch -c myfork.

Once you fork, have taken on the Maintenance Debt of that fork. You now have to:
  1. Build that code against the dependencies of your choice.
  2. Regression test that release.
  3. Decide which contributions to the main branch are worth cherry-picking into your fork.
  4. If you do want to cherry pick, adapting those patches to your fork. The more you change your fork, the
    more the cost of adaptation.
  5. Features you add to your branch which you don't contribute back become yours to maintain forever.
  6. If the OSS code adds a similar feature to yours, you are faced with the choice of adapt or ignore. Ignore it and your branch is likely to diverge fast and cherry-picking off the main branch becomes near-impossible.
That's what Maintenance Debt is, then: the extra work you add when deciding to fork a project. Now, how to keep that cost down?
  1. Don't fork.
  2. Try and get others to add the features for you into the OSS releases.
  3. As people add the features/fixes you desire, provide feedback on that process.
  4. Contribute tests to the project to act as regression tests on the behaviours you need. This is a good one as with the tests inside a project, their patch and jenkins builds will catch the problem early, rather than you finding them later.
  5. If you do write code, fixed and features, work to get them in. It's not free; all of us have to put in time and effort to get patches in, but you do gain in the long term.
  6. Set up your Jenkins/CI system so that you can do builds against nightly releases of the OSS projects you depend on (Hadoop publishes snapshots of branch 2 and trunk for this). Then complain when things break.
  7. Test beta releases of your dependencies, and the release candidates, and complain when things break. If you wait until the x.0 releases, the time to get a fix out is a lot longer —or worse, someone can declare that a feature is now "fixed" and cannot be reverted.

If you look at that list, testing crops up a lot. That's because compile and test runs are how you find out regressions. Even if you offload the maintenance debt to others, validating that their work meets your needs is a key thing to automate.

Get your regression testing in early.

[photo: looking west to the Coastal Range, N/W of Rickreall, Willamette Valley, OR, during a 4 day bike tour w/ friends and family]

2015-10-14

Scalene

Stokes Croft Graffiti, Sept 2015

I've been a busy bunny writing what has grown into a fairly large Spark patch: SPARK-1537, integration with the YARN timeline server. What starts as a straightforward POST event, GET event list, GET event code, grows once you start taking into account Kerberos, transient failures of the endpoints, handling unparseable events (fail? Or skip that one?), compatibility across versions. Oh, and testing all of this; I've got tests which spin up the YARN ATS and the Spark History server in the same VM, either generate an event sequence and verify it all works -or even replay some real application runs.

And in the process I have learned a lot of Scala and some of the bits of spark.

What do Iike?
  • Type inference. And not the pretend inference of Java 5 or groovy
  • The match/case mechanism. This maps nicely to the SML case mechanism, with the bonus of being able to add conditions as filters (a la Erlang).
  • Traits. They took me while to understand, until I realised that they were just C++  mixins with a structured inheritance/delegation model. And once so enlightened, using them became trivial. For example, in some of my test suites, the traits you mix in define what it is services bring up for the test cases.
  • Lists and maps as primary language structures. Too much source is frittered away in Java creating those data structures.
  • Tuples. Again, why exclude them from a language?
  • Getting back to functional programming. I've done it before, see.

What am I less happy about?
  • The Scala collections model. Too much, too complex.
  • The fact that it isn't directly compatible with Java lists and maps. Contrast with Groovy.
  • Scalatest. More the runner than the tests, but the ability to use arbitrary strings to name a test case, means that I can't run (at least via maven) a specific test case within a class/suite by name. Instead I've been reduced to commenting out the other tests, which is fundamentally wrong. 
  • I think it's gone overboard on various features...it has the, how do I say it, C++ feel.
  • The ability to construct operators using all the symbols on the keyboard may lead to code less verbose than java, but, when you are learning the specific classes in question, it's pretty impenetrable. Again, I feel C++ at work.
  • Having to look at some SBT builds. Never use "Simple" in a spec, it's as short-term as "Lightweight" or "New". I think I'll use "Complicated" in the next thing I build, to save time later.
Now, what's it like going back to doing some Java code? What do I miss?
  • Type inference. Even though its type inference is a kind that Milner wouldn't approve of, it's better than not having one.
  • Semicolons being mostly optional. Its so easy to keep writing broken code.
  • val vs var. I know, Java has "final", but its so verbose we all ignore it.
  • Variable expansion in strings. Especially debugging ones.
  • Closures. I look forward to the leap to Java 8 coding there.
  • Having to declare exceptions. Note than in Hadoop we tend to say "throws IOException", which is a slightly less blatant way of saying everything "throws Exception". We have to consider Java's explicit exception naming idea one not to repeat on the grounds it makes maintenance a nightmare, and precludes different implementations of an interface from having (explicitly) different failure modes. 
You switch back to Java and the code is verbose —and a lot of that is due to having to declare types everywhere, then build up lists and maps one by one, iterating over them equally slowly. Again, Java 8 will narrow some of that gap.

When I go back to java, what don't I miss?
  • A compiler that crawls. I don't know why it is so slow, but it is. I think the sheer complexity of the language is a likely cause.
  • Chained over-terseness. Yes, I can do a t.map.fold.apply chain in Spark, but when you see a stack trace, having one action per line makes trying to work out what went wrong possible. It's why I code that way in Java, too. That said, I find myself writing more chained operations, even at the cost of stack-trace debuggability. Terseness is corrupting.
One interesting thing is that even though I've done big personal projects in Standard ML, we didn't do team projects in it. While I may be able to sit down and talk about correctness proofs of algorithms in functional programming, I can't discuss maintenance of someone else's code they wrote in a functional language, or how to write something for others to maintain.

Am I going to embrace Scala as the one-and-true programming language? No. I don't trust it enough yet, and I'd need broader use of the language to be confident I was writing things that were the right architecture.

What about Scala as a data-engineering language? one stricter than Python, but nimble enough to use in notebooks like Zepplin?

I think from a pure data-science perspective, I'd say "Work at the Python level". Python is the new SQL: something which, once learned, can be used broadly. Everyone should know basic python. But for that engineering code, where you are hooking things up, mixing in existing Java libraries, Hadoop API calls and using things like Spark's RDDs and Dataframes, Scala works pretty well.