This article is part of a series. In Part 1, I outlined the premise: ML/AI shops can borrow tips and best practices from how algorithmic trading (“algo trading”) shops operate. The rest of the articles explore those ideas in more detail.
Earlier in this series, I explored the importance of data infrastructure in the ML/AI world. Retrieval speed, data availability, and data dictionaries are key elements of a successful operation. A closely related concern is data access: what controls have you established around your data? Not just “who can pull which records” but also, “in general, how should information flow (or not flow) inside this company?”
This is another opportunity for the ML/AI world to borrow ideas from algo traders, as they are quite familiar with the notion of data controls.
Banks hold a lot of sensitive information, most of which is considered PII, so they develop control systems to limit the circumstances under which their staff can see it and how it can be used. Many large banks mix retail (consumer) banking, credit/lending (such as mortgages), investing (which includes trading), and merger/acquisition (M&A) activities under one roof, so they face some special challenges.
Let’s say the M&A office is handling a deal between two companies. The traders would certainly find this information useful, since any details would have predictable impact on both companies’ share prices, and they could place trades accordingly. This would also be an unfair (and, highly illegal) advantage, since the traders would be basing their decisions on information that was not public.
Large banks avoid this inappropriate information sharing by establishing formal barriers – sometimes known as Chinese Walls or Great Walls – between departments. (This is in addition to the standard controls that forbid people from sharing information with someone outside the bank.) The Chinese Wall prevents communication between M&A and the trading floor, which ensures that the traders only hear about the merger when it becomes public knowledge and everyone knows about it.
Even the trading operation is split, according to who is the ultimate customer. Some traders place orders at the behest of the bank’s clients, while the internal (proprietary, or “prop”) traders place orders on behalf of the bank. Even though both groups work under the same roof, they are walled off from each other such that their activity won’t mix. It is illegal for prop traders to use clients’ orders to influence their own.
While the trading floor is walled off from receiving certain information from inside the bank, other departments limit what they receive from the outside. The bank’s consumer lending group, for example, is limited in what personal details they can use for underwriting criteria (determining creditworthiness). Those rules are based on federal lending law – for example, your gender and marital status may not be used to deny you a loan – but it’s bank’s responsibility to establish data collection and data handling controls in order to prove that this information did not factor into their decision.
Not all data silos are bad
A good deal of IT security involves handling the “insider threat” of employees misusing their access to see and steal sensitive information. Perhaps you’ve already addressed that in your company, by implementing access controls around who can see what data. Your customer service reps can only see select information about the person with whom they’re currently speaking, and every access is logged. Right? Right.
That’s a great first step, but the weak link in that chain is usually … your company’s appetite for data analysis. Even though your customer service reps can’t see every detailed record, your data team can. They have to, in order to develop and test their models.
Not only that, but you also make it easy for them to do so, since you pour all of your data into a single data lake precisely so they can combine different datasets. Working just with your sales data can yield certain insights, sure. Blending your sales data with historical weather data, though, that can be a real eye-opener. As is blending the marketing data with the customer service data. Maybe you mine the “password questions” to drive personalization efforts. Or, you shuffle the mobile phone numbers for two-factor authentication over to a campaign to grow your user base. It’s tempting to try to blend data from every source and department in the quest for some useful, monetizable, insight or action.
This is why we often hear that data silos are bad. And they usually are. A failure to combine datasets is a failed business opportunity. But when you take the time to consider the ramifications of wide-ranging data access – reflecting on the M&A example above – you’ll ask better questions around who should see what information, and when it’s appropriate for them to join data that was collected by different departments. (Especially when the people who provided that data didn’t know that you planned to use it for something else. For more details on that, check out my series on data ethics. )
One very subtle reason to be careful is that, lacking proper data provenance, it can be very hard to un-blend data. Suppose a customer says that they don’t want you to use their data for your ML/AI efforts. Or, say the laws are changing and you need to remove certain types of information from your training data. What would it take to comply?
The data within the data
Internal barriers and data access controls protect you from mixing data in inappropriate ways. By comparison, enforcing privacy controls protects the people whom this data represents. Privacy is hardly a new issue, though it has become more visible in recent years as companies mine our increasingly digital activities.
Regulatory matters will determine how you can use data that is related to individual people. California’s CCPA and Europe’s GDPR are top of mind for a lot of companies these days, because they are fairly new. HIPAA and PCI DSS are also relevant regulations for companies that handle medical and payment card data, respectively.
Compliance covers what you must do, to remain within the law. There’s still plenty of scope around what you should do. At a high level, consider the ramifications of how this data – whether in raw form, or distilled into a data product – could be misused. (I briefly explored the idea of checking your data supply chain in a post on data ethics.) A lot of potential misuse is rooted in identifying individuals, so a key element of data privacy involves using or sharing data in ways that are less identifiable or reversible to an individual.
One way to open those discussions is to ask about need: “Do we need to provide this data with identifiable information? Can we instead provide aggregations of data, to hide the individual(ly identifiable) records?” You can further obscure someone’s identity by adopting data diversity standards. For example, you can decide that any splittable subset of a data aggregate will be based on some minimum number of customers, not to exceed some maximum percentage of the overall dataset.
Beware data walking out the door
In our example from earlier, the M&A team doesn’t talk to other groups inside the bank. They also don’t share anything with people outside of the bank, which is another step to keep sensitive information under wraps.
You may think you’ve already covered this, since your employees all sign a confidentiality agreement. But how often do you encourage those same employees to hand data to third parties? If you’re using SaaS tools to manage your payroll, customer service requests, or data labeling, these are all avenues for your data to leave your company’s four walls. If you haven’t checked the privacy policies of these companies, you also don’t know where the data will go from there.
Putting walls around and inside your data lake
Sharing data inside your company can lead to new and valuable insights. It can also open you up to larger problems. You’d do well to establish controls that determine who can see a given dataset (or even fields of a dataset), and define rules under which your data scientists may combine them.
Many thanks to Joshua Ulrich and other colleagues for reviewing this post and providing feedback. Any errors that remain are mine.