My latest book: Twin Wolves: Balancing risk and reward to make the most of AI
(Photo by Jason Pofahl on Unsplash)
If your company collects any kind of data – and that's just about every company these days, let's be honest – then you'll eventually need to develop a data retention policy.
You may have already done this because of industry regulations. Even if you're not under that kind of constraint, it's helpful to document what data you collect and how you handle it. This will bring key company figures to the same understanding. As a bonus: in the event the laws change for your field, you'll already be on the road to compliance.
You'll want to work closely with your legal counsel to develop a data retention policy. Here I'll walk through several ideas to keep in mind as you do so:
Defining a data retention policy involves mapping out:
Let's review these in order:
The data you collect will depend on your company's data/AI strategy, as well as your business model and operations. That usually amounts to:
That last point merits special attention. There is certain data that is within your reach but not worth the trouble of storing. A prime example is any data that is considered sensitive but you're unable to protect. (Consider the online merchants who use a third-party shopping cart system, and who make it very clear that they never see your credit card number.) Then there's the data you'd feel uncomfortable turning over to a regulator or law enforcement if compelled to do so. (Hence why some VPN providers claim that they don't keep logs.) And then there's the purely practical matter of data that's too expensive to store. (Say, massive amounts of web traffic logs that you'd be unable to retrieve quickly enough for any kind of project work.)
The general idea here is to stick to data for which you have a clearly-defined purpose. You want to avoid being a packrat, collecting data just in case it becomes useful someday.
Where you store it. In the most simple scenario, you'd establish a local or cloud-based data storage system and be done with it. But many services these days are accessible worldwide, so your company and your customers may exist in different jurisdictions. That will shape what data you can collect and in which country you can store it.
Note that "storage" might include derived data products such as models or published analyses. You might need to be able to trace a data record through to these systems in order to remove it, or to prove that it never made it into that derived artifact in the first place.
How long you keep it. As a baseline, consider data retention laws. If the local government requires you to hold on to transactional data for some number N years, then you need to find a way to store that data for N years. Plain and simple.
From there, you'll need to sort out how long a given data record is useful to the business. At what point is it no longer valid for reports, analyses, or models? When do you roll up fine-grained, individual records into aggregate figures for historical research?
Then there's the practical question of how much data can you afford to store. "Disk is cheap" is a common refrain, and there's some truth to that. But remember that "storage" involves more than just disk space. You need to design future-proof storage layouts so you can actually find the data later. That will play into any manual or automated processes that pack up the data for archiving, and those processes must account for both internal and external security concerns… And so on.
All of this influences staffing decisions, as well, because you will need people of a certain skill level to handle these matters.
Lastly, you'll need to determine what triggers early or manual deletion. Should a legal matter compel you to remove certain records, you must have some way to find the data in question and remove it. And that takes us to the last point:
How you delete it: This will vary based on the type of data, so I can't provide much general guidance here. I can, however, suggest that you consider how long the data lives in your systems after someone clicks that delete key. Even if it's removed from the main datastore, will it also disappear from any secondary or hot-spare data storage systems? How long till it expires out of your backups, and out of the refresh cycle of derived data artifacts such as reports and models?
Once you've sorted out what data you'll keep and for how long, you're ready to develop your policy. Don't let the short length of this section fool you. Getting through the specifics of your data retention policy will likely take multiple rounds of review, and involve several departments.
To start, you'll want to involve key figures. If someone's job touches data or policy in any way, that person (or their department head) must weigh in. That includes heads of IT, application development (software engineering), product, data, HR, finance, legal … pretty much every department in the company. This may sound like a lot of people. And it is! This is not the sort of document where you want to risk any gaps or hand-waving later on.
The plus side is that, despite the number of people involved, it's mostly a matter of everyone making note of:
and then noting any conflicts.
Once you've sorted out the conflicts, your legal team will want one last pass on the document. Why so? The clue is in the name: "policy." Your attorneys will have to interpret and defend this policy should someone outside your company question it. As such, that team gets the final sign-off on all policy matters.
Zooming out, developing a data retention policy is mostly a matter of documenting your business needs, opportunities, and constraints.
If you see your company's data collection and retention through this lens, you'll be less likely to collect every bit of data "just in case" and you'll be prepared to answer the question: "what data do you collect, and how long do you keep it?"
Complex Machinery 046: It's a long way down
The latest issue of Complex Machinery: genAI keeps digging a hole. Will it be able to climb out?
Complex Machinery 047: What's left after it all falls apart
The latest issue of Complex Machinery: If AI is indeed a bubble, what happens after the crash?