The Getty generative AI bot and lawsuit-free datasets

Posted by Q McCallum on 2023-09-27

Stock photo service Getty has unveiled a generative AI tool. Before you say, “so has everyone else,” it’s worth noting what sets theirs apart: the underlying models were trained on Getty’s massive catalog of images.

(Credit where it’s due: I originally found this story in Les Echos .)

For one, this is a great idea. It’s a way for Getty to make a name for itself in the generative AI space and at the same time monetize its proprietary dataset.

Two, allow me to emphasize that phrase, “proprietary dataset.” Compared to some of the bigger-name generative AI services, which are facing lawsuits for copyright violations, Getty had already secured the rights to the images in its training data.

Lawsuits, especially those in uncharted waters, introduce unwanted uncertainty into business planning. (Hence my catchphrase: you never want to be a case law pioneer.) “Will the affected service continue unchanged? Will it cease to exist?” You won’t know until the lawsuits are resolved. This should be top of mind for anyone building on OpenAI or similar providers.

All of that leads to something that’s been on my mind for a while now:

Three, welcome to the cottage industry of known-safe, lawsuit-free training datasets. If you plan to use or build on someone’s AI-based service, you’ll want some assurance that they have acquired the necessary licenses to use the underlying training data. Ditto for an upstream provider that’s selling you the raw data.

Will these services or datasets be as good as the ChatGPTs or Stable Diffusions of the world? Depends on how you define “good.” Smaller companies will damn the legal risks and go with whatever’s cheapest or most popular. (If they’ve made the foolish move of building their entire business model around a contested service or dataset, though, they might later reconsider.) But a larger, established company won’t want a service provider that’s mired in pending legal action.

Four, that noise you hear? It’s the sound of companies scouring their attic for proprietary datasets they can package up and sell. If that describes your situation, consider these five questions:

1/ Why didn’t you do this sooner? The opportunity to create data products predates generative AI by a wide margin. Some of the bigger data brokers have been around since the 1990s.

2/ Have you truly secured legal rights to this data? Or do you only think you have? If you plan to label this as a safe, lawsuit-free dataset, you’ll need to back that claim. A little time with your legal department today can pay dividends down the road.

3/ Even if you have legal rights, what about moral and ethical issues? Let’s say your generative AI bot leaks a customer’s personal information. Your app’s terms of service (TOS) will cover you in a court of law. But the court of public opinion runs on different rules.

4/ How, specifically, will you package this data into a product? Will you sell reports? Offer up an AI model behind an API? Release some other distilled version of the data? Take the time to be strategic as you define and market your offering. It’s easy to just release the raw data, yes. But doing so gives someone else the opportunity to make money on refined products.

5/ Are you planning for defense? It’s not just competition you have to worry about. Data that you release – either in raw form or packaged as a product like a report or AI model – is open to misuse. Warding off problems requires a degree of preparation and adversarial thinking, and it’s a moving target.