2017-09-09

Stocator: A High Performance Object Store Connector for Spark


Behind Picton Street

IBM have published a lovely paper on their Stocator 0-rename committer for Spark

Stocator is:
  1. An extended Swift client
  2. magic in their FS to redirect mkdir and file PUT/HEAD/GET calls under the normal MRv1 __temporary paths to new paths in the dest dir
  3. generating dest/part-0000 filenames using the attempt & task attempt ID to guarantee uniqueness and to ease cleanup: restarted jobs can delete the old attempts
  4. Commit performance comes from eliminating the COPY, which is O(data),
  5. And from tuning back the number of HTTP requests (probes for directories, mkdir 0 byte entries, deleting them)
  6. Failure recovery comes from explicit names of output files. (note, avoiding any saving of shuffle files, which this wouldn't work with...spark can do that in memory)
  7. They add summary data in the _SUCCESS file to list the files written & so work out what happened (though they don't actually use this data, instead relying on their swift service offering list consistency). (I've been doing something similar, primarily for testing & collection of statistics).

Page 10 has their benchmarks, all; of which are against an IBM storage system, not real amazon S3 with its different latencies and performance.

Table 5: Average run time


Read-Only 50GB
Read-Only 500GB
Teragen
Copy
Wordcount
Terasort
TPC-DS
Hadoop-Swift Base
37.80±0.48
393.10±0.92
624.60±4.00
622.10±13.52
244.10±17.72
681.90±6.10
101.50±1.50
S3a Base
33.30±0.42
254.80±4.00
699.50±8.40
705.10±8.50
193.50±1.80
746.00±7.20
104.50±2.20
Stocator
34.60±0.56
254.10±5.12
38.80±1.40
68.20±0.80
106.60±1.40
84.20±2.04
111.40±1.68
Hadoop-Swift Cv2
37.10±0.54
395.00±0.80
171.30±6.36
175.20±6.40
166.90±2.06
222.70±7.30
102.30±1.16
S3a Cv2
35.30±0.70
255.10±5.52
169.70±4.64
185.40±7.00
111.90±2.08
221.90±6.66
104.00±2.20
S3a Cv2 + FU
35.20±0.48
254.20±5.04
56.80±1.04
86.50±1.00
112.00±2.40
105.20±3.28
103.10±2.14

The S3a is the 2.7.x version, which has the stabilisation enough to be usable with Thomas Demoor's fast output stream (HADOOP-11183). That stream buffers in RAM & initiates the multipart upload once the block size threshold is reached. Provided you can upload data faster than you run out of RAM, it avoids the log waits at the end of close() calls, so has significant speedup. (The fast output stream has evolved into the S3ABlockOutput Stream (HADOOP-13560) which can buffer off heap and to HDD, and which will become the sole output stream once the great cruft cull of HADOOP-14738 goes in)

That means in the doc, "FU" == fast upload, == incremental upload & RAM storage. The default for S3A will become HDD storage, as unless you have a very fast pipe to a compatible S3 store, it's easy to overload the memory

Cv2 means MRv2 committer, the one which does  single rename operation on task commit (here the COPY), rather than one in task commit to promote that attempt, and then another in job commit to finalise the entire job. So only: one copy of every byte PUT, rather than 2, and the COPY calls can run in parallel, often off the critical path

 Table 6: Workload speedups when using Stocator



Read-Only 50GB
Read-Only 500GB
Teragen
Copy
Wordcount
Terasort
TPC-DS
Hadoop-Swift Base
x1.09
x1.55
x16.09
x9.12
x2.29
x8.10
x0.91
S3a Base
x0.96
x1.00
x18.03
x10.33
x1.82
x8.86
x0.94
Stocator
x1
x1
x1
x1
x1
x1
x1
Hadoop-Swift Cv2
x1.07
x1.55
x4.41
x2.57
x1.57
x2.64
x0.92
S3a Cv2
x1.02
x1.00
x4.37
x2.72
x1.05
x2.64
x0.93
S3a Cv2 + FU
x1.02
x1.00
x1.46
x1.27
x1.05
x1.25
x0.93


Their TCP-DS benchmarks show that stocator & swift is slower than TCP-DS Hadoop 2.7 S3a + Fast upload & MRv2 commit. Which means that (a) the Hadoop swift connector is pretty underperforming and (b) with fadvise=random and columnar data (ORC, Parquet) that speedup alone will give better numbers than swift & stocator. (Also shows how much the TCP-DS Benchmarks are IO heavy rather than output heavy the way the tera-x benchmarks are).

As the co-author of that original swift connector then, what the IBM paper is saying is "our zero rename commit just about compensates for the functional but utterly underperformant code Steve wrote in 2013 and gives us equivalent numbers to 2016 FS connectors by Steve and others, before they started the serious work on S3A speedup". Oh, and we used some of Steve's code to test it, removing the ASF headers.

Note that as the IBM endpoint is neither the classic python Openstack swift or Amazon's real S3, it won't exhibit the real issues those two have. Swift has the worst update inconsistency I've ever seen (i.e repeatable whenever I overwrote a large file with a smaller one), and aggressive throttling even of the DELETE calls in test teardown. AWS S3 has its own issues, not just in list inconsistency, but serious latency of HEAD/GET requests, as they always go through the S3 load balancers. That is, I would hope that IBM's storage offers significantly better numbers than you get over long-haul S3 connections. Although it'd be hard (impossible) to do a consistent test there, I 'd fear in-EC2 performance numbers to be actually worse than that measures.

I might post something faulting the paper, but maybe I'll should to do a benchmark of my new committer first. For now though, my critique of both the swift:// and s3a:// clients is as follows

Unless the storage services guarantees consistency of listing along with other operations, you can't use any of the MR commit algorithms to reliably commit work. So performance is moot. Here IBM do have a consistent store, so you can start to look at performance rather than just functionality. And as they note, committers which work with object store semantics are the way to do this: for operations like this you need the atomic operations of the store, not mocked operations in the client.

People who complain about the performance of using swift or s3a as a destination are blisfully unaware of the key issue: the risk of data loss due inconsistencies. Stocator solves both issues at once.

Anyway, means we should be planning a paper or two on our work too, maybe even start by doing something about random IO and object storage, as in "what can you do for and in columnar storage formats to make them work better in a world where a seek()+ read is potentially a new HTTP request."

(picture: parakeet behind Picton Street)