Unless you’ve worked in the financial space, the name “Knight Capital” likely won’t ring a bell. This professional, high-tech market-maker was once the largest trader in US equities, with $21 billion in daily transaction volume.
I emphasize the word “was,” because Knight Capital no longer exists. On this day in 2012, the company fell.
Unlike many newsworthy financial meltdowns, Knight’s incident wasn’t rooted in hubris or dodgy behaviors. Theirs is a tale of operational risk, complexity, and how everyday events can sometimes go awry.
The Knight Capital story serves as a warning to every company that employs software, AI, robotics, or other automation in a highly-connected environment.
It’s easy to think of a large incident as a single event with a single cause. In a complex system, most such incidents occur because several smaller, fairly innocuous issues just happen to collide in an unfortunate manner.
For Knight Capital, those issues were:
- The rollout of a new order type on the New York Stock Exchange (NYSE).
- Repurposing some old, lingering code to support that new order type.
- An incomplete deployment of that new code to its servers.
How could these possibly have led to a catastrophic loss? Exchanges add new order types as needed; dev teams sometimes repurpose old code; and incomplete deployments happen now and then. Taken on their own, the outcomes range from “not an issue” to “mildly annoying.” But in this case, the combination proved fatal:
- The code rollout missed one of Knight’s servers.
- The developers had repurposed a section of old code for the new order type. Therefore, the server that didn’t get the update ran the old code when it saw the new order type.
- It just so happened, this old code was a test program that was designed for very aggressive trading.
- When Knight’s crew saw that the trading system wasn’t behaving as expected, they did what every responsible, reasonable tech team would do: they figured the new code was the problem, so they rolled back to the old code.
- Which would have been great, except that …
- … now the old (flawed) code was running on all of the servers, losing on every trade.
Less than an hour after market open, the collective weight of those losing trades – $440 million – wiped out Knight Capital.
There are lots of take-away lessons from this story. One particularly subtle lesson stems from complex systems:
Small issues can combine to trigger large problems,
and those problems are not always obvious.
A complex system is a highly-connected network of smaller elements. The “complexity” stems not from the number of elements, per se, but from the idea that it’s impossible to see all of the connections at once. Since you can’t see the specific outcome of a component failure in advance, you only catch the daisy-chain of failures in real-time. By that point it’s often too late.
Consider the three factors that led to Knight Capital’s meltdown. No one could have looked at them and imagined this specific turn of events. Had any one of those factors had been absent, I expect the company would still be in business today.
What’s the solution, then? You can’t eliminate the possibility of a complex systems failure unless you eliminate the complex system. (That’s rarely an option – you don’t always have the choice to shift to a simple, smaller, non-connected world.) But you can establish defenses through operational risk practices: a mix of testing, monitoring, checking procedures, and proactively tackling smaller problems will help you dodge some failures and reduce the impact of others.
If we apply this thinking to the Knight situation:
- Testing: NYSE would have required all participants to test their code to make sure it would properly handle the new order type. We know that Knight performed those tests, then, and it most likely underwent its own software testing internally. This worked as expected.
- Checking procedures: Incomplete code deployments are not an everyday occurrence, but they do happen. I imagine a world in which Knight’s team had followed up on the deployment to make sure that the code had gone to all of their servers.
- Monitoring: Knight’s team could tell that something was wrong, but they weren’t able to pin it to a particular server misbehaving. They may have caught this with enough fine-grained monitoring in place. “Why is the trade volume so high on Server #4?”
- Proactively tackling smaller problems: From what I’ve read, the aggressive test program was no longer in use. Had Knight’s developers proactively removed this test program from the codebase, it would not have been present the day of the rollout and therefore it would not have triggered.
If these three small issues can topple a mature, well-run financial juggernaut, they can topple any firm. And this isn’t limited to the financial space. If your company builds software, AI, or robotoics for automation, then it faces similar risks.
Take a moment to reflect on your own company. Where do you see minor, lingering code flaws? People skirting procedures, or procedures that are poorly understood? Misconfigured monitoring systems? These are easy to ignore because they aren’t causing you trouble right now. That lack of immediate pain makes it tougher to imagine the specific incident that may manifest. That’s fair.
That said, all you have to know is that something might go wrong, so you have to be proactive. Cleaning up the so-called “small” issues and establishing alert systems may improve your company’s long-term survival.