(Photo by Bradyn Trollip on Unsplash)
Unless you've worked in the financial space, the name "Knight Capital" likely won't ring a bell. This professional, high-tech market-maker was once the largest trader in US equities, with $21 billion in daily transaction volume.
I emphasize the word "was," because Knight Capital no longer exists. On this day in 2012, the company fell.
Unlike many newsworthy financial meltdowns, Knight's incident wasn't rooted in hubris or dodgy behaviors. Theirs is a tale of operational risk, complexity, and how everyday events can sometimes go awry.
The Knight Capital story serves as a warning to every company that employs software, AI, robotics, or other automation in a highly-connected environment.
It's easy to think of a large incident as a single event with a single cause. In a complex system, most such incidents occur because several smaller, fairly innocuous issues just happen to collide in an unfortunate manner.
For Knight Capital, those issues were:
How could these possibly have led to a catastrophic loss? Exchanges add new order types as needed; dev teams sometimes repurpose old code; and incomplete deployments happen now and then. Taken on their own, the outcomes range from "not an issue" to "mildly annoying." But in this case, the combination proved fatal:
Less than an hour after market open, the collective weight of those losing trades – $440 million – wiped out Knight Capital.
(I've vastly oversimplified for brevity. If you'd like the full story, I highly recommend the write-ups by Henrico Dolfing and Scott E.D. Skyrm, plus the official SEC document 3-15570.)
There are lots of take-away lessons from this story. One particularly subtle lesson stems from complex systems:
Small issues can combine to trigger large problems,
and those problems are not always obvious.
A complex system is a highly-connected network of smaller elements. The "complexity" stems not from the number of elements, per se, but from the idea that it's impossible to see all of the connections at once. Since you can't see the specific outcome of a component failure in advance, you only catch the daisy-chain of failures in real-time. By that point it's often too late.
Consider the three factors that led to Knight Capital's meltdown. No one could have looked at them and imagined this specific turn of events. Had any one of those factors had been absent, I expect the company would still be in business today.
What's the solution, then? You can't eliminate the possibility of a complex systems failure unless you eliminate the complex system. (That's rarely an option -- you don't always have the choice to shift to a simple, smaller, non-connected world.) But you can establish defenses through operational risk practices: a mix of testing, monitoring, checking procedures, and proactively tackling smaller problems will help you dodge some failures and reduce the impact of others.
If we apply this thinking to the Knight situation:
If these three small issues can topple a mature, well-run financial juggernaut, they can topple any firm. And this isn't limited to the financial space. If your company builds software, AI, or robotoics for automation, then it faces similar risks.
Take a moment to reflect on your own company. Where do you see minor, lingering code flaws? People skirting procedures, or procedures that are poorly understood? Misconfigured monitoring systems? These are easy to ignore because they aren't causing you trouble right now. That lack of immediate pain makes it tougher to imagine the specific incident that may manifest. That's fair.
That said, all you have to know is that something might go wrong, so you have to be proactive. Cleaning up the so-called "small" issues and establishing alert systems may improve your company's long-term survival.
Weekly recap: 2023-07-30
random thoughts and articles from the past week
Weekly recap: 2023-08-06
random thoughts and articles from the past week