2017-02-15

Towards a doctrine of the Zero Day

The Stuxnet/Olympic games malware is awesome and the engineering teams deserve respect. There, I said it. The first in-the-field sighting of a mil-spec virus puts the mass market toys to shame. It is the difference between the first amateur rockets and the V1 cruise and V2 ballistic missiles launched against the UK in WWII. It also represents that same change in warfare.

V1 Cruise missle and V2 rocket

I say this having watched the documentary Zero Days about nation-state hacking. One thing I like about it is it's underdramatization of the coders. Gone the clich├ęd angled shots of the hooded faceless hacker coding in darkness to a bleeping text prompt on a screen that looks like something from the matrix. Instead: offices with fluorescent lights compensating for the fact that the only people allocated windows are managers. What matrix-esque screen shots there were contained x86 assembly code in the font of IDA, showing asm code snippets accurate enough to give me flashbacks of when I wrote Win32/C++ code. Add some music and coffee mugs and it'd start to look like the real world.

The one thing they missed out on is the actual engineering; the issue tracker, with OLYMPIC-342, "doesn't work with Farsi version of Word" being the topic of the standup; the monthly regression test panic when when windows or flash updates shipped and everyone feared the upgrade had fixed the exploits. Classic engineering, hampered by the fact that the end users would never send stack traces. Even determining if your code worked in production would depend on intermittent status reports from the UN or order numbers for new parts from down the centrifuge supply chain. Let's face it: even getting the test hardware must have been an epic achievement of its own.

Because Olympic Games was not just a piece of malware using multiple zero days and stolen driver certificates to gain admin access on gateway systems before jumping the airgap over USB keys and then slowly sabotage the Iranian centrifuges. It was evidence that the government(s) behind decided that cyber-warfare (a term I really hate) had moved from a theoretical "look, this uranium stuff has energy" to the strategic "let's call this the manhattan project"

And it showed that they were prepared to apply their work against a strategic asset of another country, during peacetime. And had a larger program Nitro Zeus, intended to be the opening move of a war with Iran.

As with those missiles and their payloads, the nature of war has been redefined.

In Churchill's epic five volume history of WWII, he talks about the D-day landings, and how he wanted to watch it from a destroyer, but was blocked by King George, you ware too valuable". Churchill wrote that everyone on those beaches felt that they were too valuable to be there too -and that the people making the decisions should be there to see the consequences of them. He shortly thereafter goes on to discuss the first V1 attacks on London, discussing their morality. He felt that the "war-head". (a new word) was too indiscriminate. He was right - but given this was 14 months ahead of August 1945, his morality didn't run that deep. Or the V1 and V2 bombings had convinced him that it was the future. (Caveat: I've ignored RAF Bomber Command as it would only complicate this essay).

Eric Schlosser's book, Command and Control, discussed the post-war evolution of defence strategy in a nuclear age, and how nuclear weapons scared the military. before: 1000 bombers to destroy a city like Hamburg or Coventry. Now only one plane had to get through the air defences, and the country had lost. Which changed the economics and logistics of destroying nearby countries. The barrier to entry had just been reduced.

The whole strategy of Mutually Assured Destruction evolved there, which, luckily for us, managed to scrape us though to the twenty-first century: to now. But that doctrine wasn't immediate, and even there, the whole notion of tactical vs. strategic armaments skirted around the fact that once the first weapons went off over Germany or Korea, things were going to escalate.

Looking back though, you can see those step changes in technology and how the leading edge technologies of each war enabled the doctrine of the next. the US civil war: rifles, machine guns, ironclad naval vessels, the first wire obstacles on the battlefield. WWI: the trenches with their barbed wire and machine guns; planes and tanks the new tech, radio the emergent communications alongside those telegraphs issuing orders to "go over the top!" . WWII and Blitzkreig was built around planes and trains, radio critical to choreograph it; the Spanish civil war used to hone the concept and to inure Europe to the acceptance of bombing cities.

And in the Cold War, as discussed, missiles, computers and nuclear weapons were the tools of choice.

What now? Nuclear missiles are still the game-over weapons for humanity, but the non-nuclear weapons have changed and so the tactics of war have changed at. And just as the Manhattan Project showed how easy it was to flatten a city, the Olympic Games has shown how much damage you can do with laptops and a dedicated engineering team.

One of the screenshots in the documentary was of the North Korean dev team. They don't look like a dev team I'd recognise. It looks like the place where "breaking the build" carries severe punishment rather than having to keep the "I broke the build!" poster(*) up in your cubicle until a successor inherited it. But it was an engineering team, and a lot less expensive than their same government's missile program. And, it's something which can be used today, rather than used as a threat you dare not use.

What now? We have the weapons, perhaps a doctrine will emerge. What's likely is that you'll see multiple levels of attack

The 2016 election; the Sony hack: passive attack: data exfiltration and anonymous & selective release. We may as well assume the attacks are common, it's only in special cases that we get to directly see the outcome so tangibly.

Olympic Games and the rumoured BTC pipeline attack: destruction of targets -in peacetime, with deniability. These are deliberate attacks on the infrastructures of nations, executed without public announcement.

Nitro Zeus (undeployed) : this is the one we all have to fear in scale, but do we have to fear it's use? As the opening move to an invasion, it's the kind of thing that could be deployed against Estonia or other countries previously forced into the CCCP against their will. Kill all communications, shut down the the cities and within 24h Russian Troops could be in there "to protect Russian speakers from the chaos". China as a precursor to a forced reunification with Taiwan. Then there's North Korea. It's hard to see what a country that irrational would do -especially if they thought they could get away with it.

Us in the west?

Excluding Iraq, the smaller countries that Trump doesn't like: Cuba, N. Korea lack that infrastructure to destroy. The big target would be his new enemy, China -but hopefully the entirety of new administration isn't that mad. So instead it becomes a deterrent against equivalent attacks from other nation states with suitable infrastructure.

What we can't do though is use to as a deterrent for Stuxnet-class attacks, not just on account of the destruction it would cause, but because it's so hard to attribute blame.

I suspect what is going to happen is something a bit like the evolution of the Drone Warfare doctrine under Obama: it'll become acceptable to deploy Stuxnet-class attacks against other countries, in peacetime. Trump would no doubt love the power, though his need to seek public adulation will hamper the execution. You can't deny your work when your president announces it on twitter.

At the same time, I can imagine the lure of non-attributable damage to a competing nation state. Something that hurts and hinders them -but if they can't point the blame , what's not to lose.? That I could the Trump Regime going for -and if it does happen to, say, China, and they work it out -well, it's going to escalate.

Because that has always been the problem with the whole tactical to strategic nuclear arsenal. Once you've made the leap from conventional to nuclear weapons, it was going to escalate all the way.

Do we really think "cyber-weaponry" isn't going to go the same way? From deleting a few files, or shutting down a factory to disrupting transport, a power grid?

(*) the poster was a photo of the George Bush "mission accomplished" carrier landing, as I recall.

2017-01-28

TRIDENT-877 missile veered towards wrong continent; hemisphere

Apparently a test of an submarine launched trident missile went wrong, it started to head in the wrong direction and chose to abort its flight. The payload ended up in the Bahamas.

Aeronautics Museum

The whole concept of software engineering came out of a NATO conference in 1968.

The military were the first to hit this, because they were building the most complex systems: airplanes, ships, submarines, content-wide radar systems. And of course: missiles.

Missiles whose aim in life is to travel from a potentially mobile launch location to a preplanned destination, via a suborbital ballistic trajectory. It's inevitably a really complex problem: you've got a multistage rocket designed to be moved around in a submarine for decades, designed to be launched without much preparation at a target a few thousand miles away. Which must make the navigation a fun little problem.

We can all use GPS to work out where we are, even spacecraft which know to use the other solution to the GPS timing equation - the one which doesn't have a solution close to the geode, our model of the Earth's surface. Submarines can't use GPS while under water and they, like their deliverables, can't rely on the GPS constellation existing at the time of use. Which leaves what? Gyroscopic compasses, and inertial navigation systems: mindnumbingly complex bits of sensor trying to work out acceleration on different axes, use that, time, and its knowledge of its starting point to work out where it is. Then there's a little computer nearby using that information to control the rocket engines.

Once above enough of the atmosphere to see stars in daylight, the missiles switch to astronomy. This turns out to be an interesting area of ongoing work -IR CCDs can position vehicles at sea level when it's not cloudy (tip: always choose your war zones in desert climates). While the Trident missiles are unlikely to have been updated, a full submarine refresh is bound to have installed the shiny new stuff. And in an qualification test of a real launch -that's something you'd want to try. Though of course you would compare any celestial position data with the GPS feed.

Yet somehow it failed. Apparently this was a "telemetry problem", the missile concluded that something had gone wrong and chose to crash into the sea instead. I'm really curious about the details now, though we'll never get the specifics at a level to be that informative. First point: telemetry from the submarine to the missile? That is, something tracking the launch and providing (authenticated?) data to the missile which it could compare with its own measures? Or was it the other way around: missile data to submarine? As that would seem more likely -having the missile broadcast out an encrypted stream of all its engine data and sensor input would be exactly what you want to identify launch time problems. Perhaps it was some new submarine software which got confused, or got fed bad data somehow. If that was the case, then, if you could replicate the failure by feeding in the same telemetry, then yes, you could fix it and be confident that the specific failure was found and addressed. Except: you can't be confident that there weren't more problems from that telemetry, or other things to go wrong -problems which didn't show up as the missile had been aborted
Or it was in-missile; sensor data on the rockets misleading the navigation system. In which case: why use the term "telemetry".

We aren't ever going to know the details, which is a pity as it would be interesting to know. It's going to be kept a secret though, not just for the sake of whoever we consider our enemies to be —but because it would scare us all.

I don't see that you can say the system is production ready if there was any software problem. One with wiring up, maybe, or some other hardware problem where a replacement board -a well qualified board- could be swapped in. Maybe even an operations issue which can be addressed with changes in the runbook. But software? No.

How do you show it works then? Well, testing is the obvious tactic, except, clearly, we can't afford to. Which is a good argument in favour of cruise missiles over ICBMs: they cost less to test.

Tomahawk Cruise missile

Governments just don't take into account the software engineering and implementation details of modern systems into account, of which missiles are a special case, but things like the F-35 Joint Strike Fighter another. Some the software from that comes from BAe Systems a few miles away, and from what I gather, it's a tough project. The usual: over-ambitious goals and deadlines, conflicting customers, integration problems, suppliers blaming each other, etc, etc. Which is why the delivery and quality of the software is called out a a key source of delays, this in what is self-admittedly the world's largest defence programme.

It's not that the teams aren't competent —it's that the systems we are trying to build are beyond what we can currently do, despite that ~50+ years of Software Engineering.

2016-12-01

How long does FileSystem.exists() take against S3?

Ice on the downs

One thing I've been working on with my colleagues is improving performance of Hadoop, Hive and Spark against S3, one exists() or getFileStatus() call at a time.

Why? This is a log of a test run showing how long it takes to query S3 over a long haul link. This is midway through the test, so the HTTPS connection pool is up, DNS has already resolved the hostnames. So these should be warm links to S3 US-east. Yet it takes over a second just for one probe.
2016-12-01 15:47:10,359 - op_exists += 1  ->  6
2016-12-01 15:47:10,360 - op_get_file_status += 1  ->  20
2016-12-01 15:47:10,360 (S3AFileSystem.java:getFileStatus) -
  Getting path status for s3a://hwdev-stevel/numbers_rdd_tests
2016-12-01 15:47:10,360 - object_metadata_requests += 1 -> 39
2016-12-01 15:47:11,068 - object_metadata_requests += 1 -> 40
2016-12-01 15:47:11,241 - object_list_requests += 1 -> 21
2016-12-01 15:47:11,513 (S3AFileSystem.java:getFileStatus) -
  Found path as directory (with /)
The way we check for a path p in Hadoop's S3 Client(s) is
HEAD p
HEAD p/
LIST prefix=p, suffix=/, count=1
A simple file: one HEAD. A directory marker, two, a path with no marker but 1+ child: three. In this run, it's an empty directory, so two of the probes are executed:
HEAD p => 708ms
HEAD p/ => 445ms
LIST prefix=p, suffix=/, count=1 => skipped
That's 1153ms from invocation of the exists() call to it returning true —long enough for you to see the log pause during the test run. Think about that: determining which operations to speed up not through some fancy profiler, but watching when the log stutters. That's how dramatic the long-haul cost of object store operations are. It's also why a core piece of the S3Guard work is to offload that metadata storage to DynamoDB. I'm not doing that code, but I am doing the committer to go with. To be ruthless, I'm not sure we can reliably do that O(1) rename, massively parallel rename being the only way to move blobs around, and the committer API as it stands precluding me from implementing a single-file-direct-commit committer. We can do the locking/leasing in dynamo though, along with the speedup.

What it should really highlight is that an assumption in a lot of code "getFileStatus() is too quick to measure" doesn't hold once you move into object stores, especially remote ones, and that any form of recursive treewalk is potentially pathologically bad.
Remember that that next time you edit your code.

2016-11-22

Film Review: Arrival — Whorfian propaganda

Montepelier and beyond


Given the audience numbers for Arrival, in the first fortnight of its public release, more people will have encountered linguistic theory and been introduced to the Sapier-Whorf hypothesis than in the entire history of the study of linguistics (or indeed CS & AI courses, where I presume I first encountered it).

But it utterly dodges Chomsky's critique —that being the second irony: more people know Noam Chomsky(*) for his political opinions than his contributions to linguistics and his seminal work on Grammar; regexp being type 3, and HTML being very much not. While I'm happy to willingly suspend my disbelief about space aliens appearing from nowhere, the notion that S-W implies learning a new language changes the semantics of happens-before. grated on me. I'd have really preferred an ending where the lead protagonists retreat and admit defeat to the government, wherein Chomsky does a cameo, "told you!" before turning to the person by his side and going "More tea, Lamport?"

The whole premise of S-W, hence the film, is that language constrains your thinking: new languages enable new thoughts. That's very true in computing languages; you do think of solutions to problems in different ways, once you fully grasp the tenets of language like Lisp and Prolog. In human language: less clear. It certainly exposes you to a culture, and what that culture values (hint: there is no single word for Trainspotting in Italian, nor an english equivalent of Passiagata). And the S-W work was based on the different notions of time in Hopi, plus that "13 words for snow" story which implies the Inuit see snow differently from the rest of us. Bad news there: take up Scottish Winter Mountaineering and you not only end up with lots of words for Snow (snow, hail, slush, hardpack, softpack, perennial snowfield, ET-met snow, MF-met snow, powder, rime, corniche, verglas, sastrugi, ...), you end up with more words for rain. Does knowing the word Dreich make you appreciate it more? No, just that you have more of a scale of miserable.

Chomsky argued the notion of language comprehension being hardwired into our brain, the Front Temporal Lobe being the convention. Based on my own experiments, I'm confident that the location of my transient parser failures was separate from where numbers come from, so I'm kind of aligned with him here. After all: we haven't had a good conversation with a dolphin yet, and only once we can do that could we begin to make the case for what'd happen if we met other sentient life forms.


To summarise: while enjoying the lovely cinematography and abstract nature of the film, I couldn't sit there in disbelief about the language theme while wondering why they weren't asking the interesting questions, like The Halting Problem, whether P = NP, or even more fundamental: does maths exist, or is it something we've just made up?

Maybe that'll be the sequel.


Further reading

[Alford80] Demise of the Whorf Hypothesis.

(*) This had made me realise, I shoud add Chomsky to the list of CS grandees I should seek to be gently chided by, now having ticked Milner, Gray and Lamport off the list

(picture: 3Dom on Moon Lane)

2016-11-09

Moving Abroad

Earlier this year I moved to a different country.

Whenever I think I've got accustomed to this country's differences, something happens. A minister proposes having companies publish lists of the numbers of non-British employees working on them. A newspaper denounce judges as Enemies of the People for having the audacity to rule that parliament must have a vote on government actions which remove rights from its citizens.  And then you realise: it's only just begun.
Boris meets Trump at Westmoreland House
A large proportion of the American population have just moved to the same country. For which I can issue a guarded "hello". I would say "welcome", except the country we've all moved to doesn't welcome outsiders —it views them with suspicion, especially if they are seen as different in any way. Language, religion and skin tone are the usual markers of "difference", but not sharing the same fear and hatred of others highlights you as a threat.

Because we have all moved from an apparently civilised country to one where it turns out half the people are the pitchfork waving barbarians who are happy to burn their opponents. That while we thought that humanity had put behind them the rallies for "the glorious leader" who blamed all their problems on the outsider —be it The Migrant, the Muslim, The Jew, The Mexican or some other folk demon, we hadn't; we'd just been waiting for glorious leaders that looked slightly better on colour TV.

Bristol Paintwork

One thing I've seen in the UK is that whenever something surfaces which shows how much of a trainwreck things will be (collapse in exchange rates, banks planning to move), the brexit advocates are unable to recognise or accept that they've made a mistake. Instead they blame: "the remainers", the press "talking down the country", the civil service "secretly working against brexit", the judicial system (same); disloyal companies. Being pro-EU is becoming as much a crime as being from the EU.

That's not going to go away: it's only gong to amplify as the consequences of brexit become apparent. Every time the Grand Plan stumbles, when bad news reaches the screens, someone will be needed to take the blame. And I know who it's going to be here in England —troublemakers like me.
We're sitting through a history book era. And not in a good way.

If there's one change from the past, forty years from now, PhD students studying the events, "the end of the consensus", "the burning of the elites", "the rise of the idiotocracies", or whatever it is called, they'll be using Facebook post archives and a snapshot of the twitter firehose dataset to model society. That is: unless people haven't gone back and deleted their posts/tweets to avoid being recorded as Enemies of the State

-Steve

ps: Happy Kristallnacht/Berliner Mauer Tag to all! If you are thinking of something to watch on television tonight, consider: The Lives of Others

2016-10-05

What shall we do with the Europeans in our midst?

Most of my family members are European. There: I said it. I have two German uncles. And a wife born in Nairobi, a son born in Oregon, a mother in Glasgow, a father in Ulster —a father who spent the last 20 years of his life living in France. I was born in Scotland, grew up in London, now living in Bristol. When I exercise my inherited right to an Irish passport, I shall officially remain an EU citizen, regardless of what happens in Britain.

We are all Europeans; a continent whose history of warfare is abysmal compared to the Chinese empire (before the UK started the Opium wars), Feudal Japan (before the US turned up and demanded access at gunpoint), North America (before the European colonists decided they wanted most of the land, and pretty much everywhere else. The reason Europe embraced guns, while China used gunpowder for fireworks is that one place was a stable area who liked to party, the other place a mess which liked to kill neighbours on account of: different religion. Different interpretation of the same religion. Speaking different language. Differences in which individual was considered ruler of the area.

The post-1945 EU project was an attempt to address this, by removing the barriers, boosting trade and mutual industry, making visiting the other countries easy, and making it easy to live in the other countries. Why does the latter matter? It was aimed at preventing the mass unemployment scenarios of the 1930s from developing again —or at least spreading the pain, so one country didn't get trapped in a downward economic spiral: the people weren't going to be trapped, awaiting a Glorious Leader to rescue them. Instead they could follow the jobs.

Britain 2016 is not Germany 1934.
Berlin.

We haven't burned the books yet, though I wonder how long before the list of forbidden web sites starts to be a mandatory feature of home broadband links, rather than an optional item
Berlin

And, we are a long, long way from the memorials to the people killed by an oppressive state
Berlin Buzzwords

(though I note we don't own up to our history in Slavery; there's one slow-motion holocaust we pretend is nothing to do with us).

But: the hate is beginning, and I fear where it will lead.

Right now, if someone was asking for advice as to where in Europe to set up a small software startup, to hang round with like-minded hackers, to enjoy a lifestyle of which coding is in integral part, I'm going to say: Not London. Not Bristol. Berlin


P3130134

Britain has a party of hate in power. In four months we've gone from a referendum to being in/out the EU project, to one where politicians are proposing companies provide lists of who is "foreign"; where the Prime Minister are saying the National Health Service only needs staff and doctors from the rest of the continent "for an interim period".

All of a sudden we've gone from being one of the most diverse and welcoming countries in the continent to one where there's already an implicit message "don't buy a house here —you won't be staying that long."

Berlin

And what is the nominal opposition party doing? Are they standing up and condemning such atrocious nativist xenophobia? No, they are too busy in their internal bickering to look around, and when they do say things, its almost going along with it, believing they need to appear to be harsh on immigration, accepting of the Brexit referendum —as if that is needed for power, and that their getting re-elected is more important than preventing what the conservative party is trying to do. Where are the protests? The "we are all European" demonstrations? Because I'd be there. As it is, we have Nicola Sturgeon of the Scottish Nationalist Party being the sole mainstream politician to denounce this.

Berlin Buzzwords

Does anyone really think things would stop at collecting lists of "foreigners"? Or will that just legitimise the growing number of racist hate crimes which have started up after the Brexit referendum; crimes that have got as far as murder. It is only going to make things worse, and in the absence of a functional opposition, there is nothing to stop this.

I don't know how my friends and colleagues from the rest of Europe feel about this —I haven't spoken to any this morning. But I know this: I don't feel at home in this country any more.

2016-09-30

s/macOS Sierra/r/macOS vista/

I've been using macOS sierra for about ten/eleven days now. and I've rebooted my laptop about 6+ times because the system was broken.

Two recurrent problems: failure to wake in the morning, gradual lockup of finder and transitive app failure.

Failure to wake: I go up to the laptop, hit the keyboard and mouse, nothing happens. Only way to fix: hold down the power and wait for a hard restart.

Back in 1999 I worked on a project with HP's laptop group, where we instrumented a set of colleague's laptops with a simple data collection app, then collected a few months of data. At the time this was considered "a lot of data". The result, the paper: The Secret Life of Notebooks. This showed that people tended to have a limited set of contexts, where context was defined as system setup: power, display, IP addr, and application: mail, ppt. And that people were so predictable in their use models, that doing some HMM to predict contexts wouldn't have been hard.

I ended up writing some little app which essentially did that: based on IPAddr and app (PPT, Acroread) full screen, could choose: power policy, network proxy options, sound settings (mute in meetings, etc). It was fairly popular amongst colleagues, because it would turn proxy stuff on and off for you, and know to turn off display timeouts when giving presentations; crank up the savings when on the move. When I look at Windows 8+ adaptation to network settings, or OSX's equivalent of that and the "When on battery ...", I see the same ideas. You don't get any HMM on the laptops though; for that you have to pick up an android phone and look at Google Now, something which really is trying to understand you and your habits. And, because it can read your emails, correlate those habits with emailed plans. If it really wanted to be innovative/scary it would work out who you were associated with (family, friends, colleagues, fellow students...) and use their actions to predict yours. Maybe someday.

User-wise, another interesting feature was how people viewed mail so differently when online vs offline. Offline, you'd see this workflow of outlook-> word-> outlook-> ppt-> outlook-> acroread-> outlook, ... etc, very fast cycles. It seemed uncontrolled window tabbing at first —until you realise it's peple going through attachments. Online, and people's workflow pulled in IE (it was a long time ago), and you'd get a cycle where IE was the most popular app cycled to from outlook. Email was already so full of links that the notion of reading email "offline" was dying. You could do it, but your workflow would be crippled. And that was 15+ years ago. Nowadays things would only be worse.

There was a second paper which was internal, plus circulated/presented to Microsoft. There I looked at system uptime, and especially the common sequence in the log

1998-08-23 18:15 event: hibernate
1998-08-24 09:00 event: boot

or

1998-09-01 11:20 event: suspend
1998-09-01 11:30 event: boot

That is: a lot of the time the laptop through it was going to sleep, it was really crashing.

My theory was that alongside the official ACPI sleep states S1-S5 there was actually a secret state S6, "the sleep you never awake from". Some more research showed that it was generally on startup that the process failed, and it was strongly correlated with external state changes: networks, power, monitors. It wasn't that the laptop made a mess of suspending, it was that when it came back up it couldn't cope with the changed state.

I don't know if macOS sierra has that issue: I do know that it has that problem if left attached to an external display overnight. Looking in the system logs, you can see powernap wakeups regularly (that's all displays off), but come the first user interaction event —where the displays are meant to kick off— they don't come up. This is resulting in system logs not far off from the '99 experiment

2016-09-27 22:20 powerd: suspend
2016-09-27 23:00 powerd: powernap wake
  ....

2016-09-27 23:30 powerd: powernap wake
..

2016-09-28 00:30 powerd: powernap wake
..
2016-09-28 08:30 powerd: powernap wake
..
1998-09-01 09:30 event: boot


That last one: that's me trying to use it.

I've turned off powernap to see if that makes a difference there.

That's the nightly problem. What's happened 3+ times is the lockup of Finder, with a gradual degradation of other applications as they go near its services.

First finder goes, and restarts do nothing
Finder not responding
Then the other apps fail, usually when you go near the filesytem dialogs, or the photo collection.
Safari not responding
As with finder, restart does nothing.

If it was my own code, I'd assume a lock is being acquired in the kernel on some filesystem resource and never being released. This is why locks should always have leases. Root cause of that lock/release problem? Who knows. I can't help wondering, though, if its related to all the new icloud sync feature, as that's the biggest filesystem change. I've also noticed that I usually have a USB stick plugged in; I'm going to go without that to see if it helps.

When i get this slow failure, i don't rush to reboot. It takes about 10 minutes to get my dev environment back up and running again: the IDEs, the terminal windows, etc, 2FA signing in to webapps, etc. I really don't want to have to do it. Instead I end up with bits of the UI keeling over, while I stick to the IDE, chrome, terminals. I had a bit of problem on Thursday evening when calendar locked up the extent I couldn't get the URLs for some conf calls; I had to use the phone to get the links and type them in.

Anyway, come the evening, after the conf calls and some S3a Scale tests, I kick off a shutdown.

And here a flaw of the OSX UI comes in: it assumes that whatever reason you are trying to do a shutdown for, it is not because finder has crashed. And it gives any application the right to veto the shutdown. You can't just select "shut down..." on the menu, you have to wait for any apps to block it, stop them and then continue. And even after doing all of that, I come in this morning and find the laptop, fans spinning away, me logged out but some dialog box about keychain access required. This is not shutting down, this is half hearted attempt at maybe shutting down sometimes if your OS hasn't got into a mess.

It's notable that Windows has some hard coded assumptions that a shutdown is caused by the failure of something. It also has, from the world of Windows Server, the concept that the user may not be waiting at the console waiting to click OK dialogs popped up by apps. Thus it has a harsher workflow.

  1. A WM_QUERYENDSESSION message comes out saying "we'd like to shut down, is that OK? Apps get the opportunity to veto the sesson end, but not if it's tagged as a critical shutdown. And of you don't service that event, you are considered dead and don't get a veto.
  2. The WM_ENDSESSION event sent to apps to say "you really are going down —get over it".
  3. There is a registry entry WaitToKillAppTimeout you can use to control how long the OS waits for applications and to terminate, WaitToKillServiceTimeout for services, and even HungAppTimeout to control how long an app has to respond to an exit menu request (WM_EXIT?) before being considered dead and so killed.
See? Microsoft know that things hang, that even services can hang, and that if you want to shut down then you want to shut down, not find out 12 hours later that it had stopped with a dialog box.

In contrast macOS Sierra has implicit assumptions that apps rarely hang, the OS services never deadlock, and that shutting down is a rare activity where you are happy to wait for all the applications to interact with you —even the ones that have stopped responding.

This may have held for for OS/X, but for macOS all those assumptions are invalid. And that makes shutdown far more painful and unreliable than it need be.

Now if you go low level and do a "man shutdown", you can see that a similar escalation process is built in there

Upon shutdown, all running processes are sent a SIGTERM followed by a SIGKILL.  The SIGKILL will follow the SIGTERM by an intentionally indeterminate period of time.  Programs are expected to take only enough time to flush all dirty data and exit.  Developers are encouraged to file a bug with the OS vendor, should they encounter an issue with this functionality.

I think from now on, it'll be a shutdown command from the console.

Anyway, because of all these problems, I do currently regret installing macOS sierra. It shipped to meet a deadline, rather than because it was ready.

macOS Sierra is not ready for use unless you are prepared to reboot every day, and are aware that the only way to reboot reliably is from the console.