This post is part of a series on approaching data ethics through the lens of risk: exposure, assessment, consequences, and mitigation.
- Part 1 – exploring the core ideas of risk, and how they apply to data ethics
- Part 2 and Part 3 – questions to kick off a data ethics risk assessment related to company-wide matters
- Part 4 – questions to assess individual data project
- Part 5 – risk mitigation and otherwise staying out of trouble
Whenever your company handles data – from selling it in raw form, all the way to providing data-driven services – you expose yourself to some amount of risk, especially data ethics risk. Performing a risk assessment is key to uncovering and handling problems early on.
It’s best to perform an assessment on a data project during the planning stages, so you can make proactive adjustments. That said, you can still review existing projects that you have already released into the wild, and catch problems that haven’t yet surfaced.
In Part 3 I mentioned the idea of your data supply chain: keeping track of how data enters and leaves your organization.
The exit point involves selling raw or summarized data to some outside party. How well do you know the downstream recipient? How they will use the data? You might claim that it’s not your problem what your customers do once you’ve sold them the goods. If your they do something nasty with the data, though, your name will get dragged down with theirs. It’s best to think twice before writing this off as a non-issue.
Take the analytics firm Slice Intelligence, for example. They created a service Unroll.me to help people manage their e-mail subscriptions. They also used Unroll.me to collect Lyft receipt details, which it then sold to competing ride-share firm Uber. People can debate whether this was an appropriate competitive intelligence activity on Uber’s part (especially as Uber has made it into the news for other unsavory data practices) but the fact remains that Slice Intelligence found themselves sharing Uber’s unwanted press spotlight.
Before you sell raw data to someone, ask yourself how you’d feel about making front-page news as a result.
You certainly wouldn’t build a project that makes you uncomfortable, but there are a lot of reasons other people may feel ill at ease with something that you think is perfectly innocent. People don’t react well when they feel surprised or exposed through a company’s data efforts.
As a concrete example: In cafes, you typically provide the cashier with a name to associate with the order. This can be slow, messy, and error-prone, especially during rush periods, so some cafes have switched to point-of-sale systems that instead pull the name from the credit card that is used for the purchase.
The intended outcome here – quickly getting the right name on an order – includes an unintended outcome. There are plenty of people who provide pseudonyms in such a situation because they aren’t comfortable with having their name broadcast in a public space. This includes people who have adopted a nickname to blend in with a local culture, as well as people who use a pseudonym for reasons of personal safety.
This is why your risk assessment should include feedback from a variety of perspectives in order to spot these kinds of problems early on. You certainly want diversity of experience in terms of gender and ethnicity, but also sexual identity, age range, income, and even geographic location.
For every data project, ask yourself: is this idea worth alienating groups of people who would be uncomfortable? Is it worth putting someone else at risk?
Testing a data project for discomfort is one thing. You’ll also want to test for intentional misuse: “if we sell this raw data, or sell services based on this data, how could it be used for nefarious purposes?”
Here, you can borrow an idea of red-teaming from the military world. A “red team” is a group of people in your company who have knowledge of the idea and who pretend to have malicious intent. They come up with ways to use the data or service to harm other people or other companies, and a “blue team” then figures out ways to counter those efforts.
A red-team exercise lets you uncover problems before a person with real malicious intent does it for you in the wild. For example, people have pointed out that remote-controlled “smart locks” and other “smart home” devices can serve as tools of unwanted surveillance and domestic abuse. And let’s not forget the time run-tracker app Strava disclosed locations of military bases because it put users’ running routes on public maps. Most recently, YouTube has been in the spotlight because its recommendation systems – which have already been accused of sending people down rabbit-holes of increasingly extreme political content
In all of the examples above, it sounds like some red-teaming could have saved these companies a lot of unintended consequences and unwanted press.
(The Strava case is also an example of “anonymized data isn’t always secret enough,” but we’ll cover that some other time.)
In some cases you’ll scrap a data project altogether as the result of a red-team exercise. It’s more likely that you’ll modify your plans to close off avenues of misuse while maintaining most of the intended outcomes.
A lot of datasets are biased in some form. Biased data differs from “messy” data in that it is clean, but also incomplete: it doesn’t adequately represent the full audience of your intended analysis or prediction efforts.
Having a biased dataset isn’t a problem on its own; the problem is when you fail to recognize, ackowledge, and then adjust for that bias. For example: if you have built a dataset on the banking habits of people with top-tier incomes, then that data is probably not suitable to drive a machine learning system for the banking needs of the general population.
Worse still, is when a biased dataset can lead to harm. Predictive policing systens – especially when built on facial recognition – are especially dangerous if built on biased data. They can cause law enforcement to unfairly target certain areas or populations, leading to a number of false positives (from over-scrutinizing some groups) and false negatives (from under-scrutinzing other groups).
If you don’t uncover the biases in your data before you release a project, other people certainly will. Wouldn’t you rather be the first to find the problem?
If your data project – or even your entire business model – toes the line of existing laws, that means you face a sizable risk: a single regulatory shift can limit or even close your business overnight. (This goes double if your upstream data providers toe a legal line, as their shutdown could impact your business.)
Plenty of companies intentionally establish themselves within legal grey areas, and others get very close to the limits of the law. Their data ethics risk assessments should note regulatory matters that could cause an upset. Such problems could come in the form of brand-new rules, or even clarification of an existing rule.
The largest such example in recent memory is the European General Data Protection Regulation, or GDPR. In brief, GDPR requires that companies provide more detail on how they collect and use personal data and to be more up-front in getting consent for that data collection. A lot of businesses, especially those in the online advertising and marketing space, had to make changes in order to comply with GDPR.
GDPR was hard to miss, since it made international news for several months leading up to its 2018 implementation date. Other regulatory changes may not get that kind of coverage, so it’s up to companies to do their homework on rules affecting their industry.
This concludes our list of questions to kick off a data ethics assessment. Next, in the final article in this series, I’ll wrap up with some thoughts on the other side of risk: mitigation.
(This post is based on materials for my workshop, Data Ethics for Leaders: A Risk Approach. Please contact me to deliver this workshop in your company.)