Steve Loughran: 2012-04

2012-04-19

Pythonium, or why is there no javash?

One thing I've been doing this month is learning some basic Python. Not because I have vast urge to switch all my code to a typeless language just to apply the lambda calculus to arrays (though that appeals, obviously).

No, the reason I want to learn Python can be summarised in one word: bash.

All Java projects I've touched have tended to have some startup bash script, maybe also a windows .BAT file if really needed. And those bash scripts suck.

They are written by Java developers who don't know/understand bash, and contain lots of invalid assumptions and bad code. I know, I fit into that category.

They are invariably brittle towards: spaces in paths and filenames, things not being where they should be, environment variables not being set.

They are hard to port across platforms because bash relies on lots of unix programs to do a lot of work, programs that take different arguments on Linux, MacOS and legacy SunOS platforms. (ooh, SunOS was built on BSD which is based on APIs from AT&T. I hope nobody involved in lawsuit about API copyright notices that).

They don't test within the Java JUnit framework world. You can do it, but it's hard and doesn't stress the troublespots -env variables, spaces in paths, different programs and args. This means that the various if [ ] ; queries to handle platform specificness don't get a look in before release.

As a result, those little startup scripts are an inordinate amount of trouble -probably the most brittle and lowest quality bit of source within a Java project. (second most if there are .BAT/.CMD scripts too).

Yet every Java project ends up having one. Why? There's no easy way to launch a java process with the classpath set up right, with parameters passed down and actually working.

Yes, you could double click on a JAR file with a (relative) classpath in the manifest, but how does that pass down

-Xmx16g  -server -XX:+TieredCompilation-XX:+UseCompressedOOps -XX:UseNUMA -XX:+UseParallelGC XX:+UseParallelOldGC -Dlog4j.properties=/var/app2/conf/log4j.properties

? It doesn't. Hence: the shell script.

In an ideal world. you wouldn't need it. You could have some .jar-opts file that would live alongside the JAR file that would have all these options, one to a line. You could have that .jnlp stuff actually do something useful except trying to load JavaFX on demand for all three people that are trying to run a JavaFX-based applet.

Or, and this would be nice, you could have a shell runtime that compiled and executed .java files. Mark a file as executable, put a #!/bin/java on the first line, and it could be compiled and executed on demand. Then you could have java startup code written in Java, setting up the main program.

Adding a javac compiler wouldn't add to the runtime footprint much. There's already javafx, javascript, the whole of the JRE library packages. A basic compiler is not that much space.

On devices with limited resources -storage, cpu- you could argue against on-demand compilation, but for everything else, just compile it as needed. While people doing commercial code may worry about this, if you are doing modern webapps, your source is being downloaded to every browser in the .js files, and the server-side stuff is hidden. Oh, and decompilers show us stuff anyway.

In the OSS world, having the source on everyone's machine is a tangible benefit -it gets the source into the hands of the users, no way of hiding it from the people who may have to take up the maintenance task in the future.

but no, there is no javash. Which is why I'm learning Python: so I have a language that isn't Bash or Perl to do the stuff that Java won't let me as it retains the old "compile then distribute" world view.

I don't need to learn much, just enough to write entry points that exec things or tell users off. And with ipython and the Python support in IntelliJ IDEA, that's fairly straightforward. It's certainly fun not having to wait for that compiler delay. Which is another issue: pre-compilation may save end-user time, but it wastes developer time. As a developer, I know what I value more, at least during the dev-and-test cycle.

[Photo: Stokes Croft storefronts]

2012-04-05

Joining Hortonworks to evolve #hadoop

I've left HP. I did that on Monday, enjoying a final beer at lunchtime with my soon-to-be-ex colleagues, then heading home for a few weeks of parental responsibilities during the easter break.

Later this month, I will start work at Hortonworks, pushing the Hadoop stack forwards. I am really excited about this -I know a lot of people in the company already, and it's going to be great working with them!

Although the phrase "Big Data" is getting overused, it's obvious to me that there is a real coming together of different trends to make the whole Hadoop-based ecosystem as transformational as web servers were.

There are so many devices in the modern world acting as data sources -physical devices such as mobile phones and jet engines, services such as web applications, people making use of devices and services.
In the past less data was generated -and it was normally thrown away. Too expensive to store, no perceived value.
The cost per TB of HDD has fallen such that you can now afford to keep that data for later analysis
You can't analyse it on single servers as the bandwidth of HDDs hasn't increased at the same rate as the storage capacity.
The performance of a single CPU has effectively topped out too. All that is coming is more cores, more operations/joule (hopefully), different forms of parallel computation. The free speedups that the CPU vendors used to dish out are over. It's either single-machine parallelism or multi-machine. Oh, and either way: heterogeneity of some form or other.
That means everyone is going to have to embrace parallel computing, on the single machine or in the rack -and with the right algorithms, that rack can be made to deliver linear and sometimes superlinear speedup.
If you want to work with the big datasets that you can collect today, you are going to need a rack of servers and a framework to let you process the data.
The Hadoop platform provides the framework to store the data across those hard disks, and to distribute the work across them. It is becoming the single open-source alternative to Google's internal platform.

Where the future gets really interesting is that the Hadoop ecosystem provides those core services of a distributed computing platform: bulk storage (HDFS), scheduling (MRv2), distributed state (ZooKeeper), integration with existing infrastructure (flume, squoop, Hive). These services can be used to build applications in and above Hadoop -HBase and Giraph are key examples; Cassandra a welcome friend. Big Data is the immediate reason to move into world, but ultimately it's Big Datacentre -not things like Java EE7 that just seem, well, so very last-century.

That's why I'm joining Hortonworks -to go full time on building the future platform for server-side computing.

[photo: preparing to descend into Crickhowell, Wales, 2011]