What you see here is the last week’s worth of links and quips I have shared on LinkedIn, from Monday through Sunday.
For now I’ll post the notes as they appeared on LinkedIn, including hashtags and sentence fragments. Over time I might expand on these thoughts as they land here on my blog.
I’m sharing seven questions that will help you to build a robust AI training dataset and stay out of trouble. Today is Day 6:
**6/ How can we get more data? **
I’ve noted before that you never know how much data you’ll need until you start training the model. Even with a robust, realistic dataset, the model may require more data in order to uncover meaningful patterns.
How do you handle this, then? When my clients undergo a modeling exercise, I always tell them to pick a source that they can revisit in a hurry.
Your ideal situation is to work in a company that generates a lot of data every day (think: telcos, web hosts, social media platforms), because you’ll effectively have an unlimited supply. Need more training data? Just wait a day or two and it will come to you for free.
By comparison, be wary of one-shot sources of data. You may find yourself in a pickle if you need more of that data later but the proverbial well has run dry.
I’m sharing seven questions that will help you to build a robust training dataset for your ML/AI models. Today is the seventh and final post! Thanks for sticking with me.
Today’s question is:
7/ Are we allowed to use this data?
Let’s say you have an idea for a feature that might improve your model’s performance. That feature isn’t part of your proprietary dataset, and none of your upstream data vendors have it. What do you do?
Maybe you could work with a shady data broker. Or collect it yourself in a way that violates some other company’s terms of service (TOS), like scraping images from a social media site.
You could do those things. But do you really want to? As noted in “Our Favorite Questions,” attorney Shane Glynn reminds us:
The safest course of action is also the slowest and most expensive: obtain your training data as part of a collection strategy that includes efforts to obtain the correct representative sample under an explicit license for use as training data.
The next best approach is to use existing data collected under broad licensing rights that include use as training data even if that use was not the explicit purpose of the collection.
A training dataset that leads to legal action is not a training dataset that you want. At best, you’ll win the court case but still lose time, money, and effort along the way. At worst, you’ll be forced to remove the offending data from your model. And if your business relies on that model to generate revenue, you may have to close up shop.
Stick to data that you’re actually permitted to use, and you’ll avoid this trap.
If you’ve enjoyed this series, I’ve compiled the full list into a post on my website: “Building A Training Dataset Is Hard.”
My website also has a wealth of information for AI strategy, AI risk management, hiring, and more!
Thus far, Netflix and Disney have defined ad-supported plans. Amazon Prime Video is next in line. Grocery stores are getting in on the ads space. Even Instacart is staking its claim.
And that makes me wonder: when it comes to business models, do all roads lead to ads?
As Instacart prepares to go public next week, it is a markedly different company. Envisioned in 2012 as a service that matched people at home with contract workers who would shop for them and deliver groceries, it has increasingly focused on advertising and software products as its delivery business has slowed.
This article is a couple of weeks old, but it’s still a great example of why Content Moderation Is Hard:
Between shifting news topics, shifts in language, and deliberate attempts to evade detection, content moderation is a moving target.
So … Meta plans to release a chatbot specifically for interacting with the younger crowd. And they want to give the bots some “personality,” including at least one that’s a bit sassy.
Meta faces an interesting risk/reward calculus here:
- The risk? That the bots go off the rails and say something out of line. We’ve already seen plenty of Chatbot Gone Wrong™ from other players in the space, so we know that kind of malfunction is well within the realm of possibility.
- The reward? If this works, maybe Meta can win (or, “win back”) some user-time from TikTok.
Meta will need a lot of skill, focus, and luck for this to work out in their favor. I hope they publish what they learn about keeping the bots in line and any other lessons for end-user safety (which, as we all know, has a direct connection to brand safety). The world could use some chatbot guidelines and a company as big as Meta will get a lot of practice figuring out what works.
Among the bots in the works is one called “Bob the robot,” a self-described sassmaster general with “superior intellect, sharp wit, and biting sarcasm,” according to internal company documents viewed by The Wall Street Journal.
A brief biography of Hank Asher. He didn’t so much invent the concept of Collecting And Monetizing Personal Data, but he sure as hell made his mark on the field.
“The Man Who Trapped Us in Databases” (NY Times)
Stock photo service Getty is creating its own generative AI bot:
Of special note is where they got their training data. Unlike some of the other generative AI bots out there, Getty secured rights to those photos before building their models:
On the licensing end, Peters insists that Getty’s AI photo generator is different from other AI image tools because Getty has cleared the legal rights to the photos that are being used to train the models. “It’s commercially clean, and that gives us the ability to stand behind it 100 percent,” he says. “So if you want to use generative AI to be more creative and explore different boundaries, Getty Images is the only offering out there that is fully indemnified.”
- This is an interesting data product for Getty.
- It’s also a way to dramatically reduce risk for their clients (since it’s highly unlikely that anyone will try to sue Getty for copyright infringement).
- And who knows? This may pave the way for other companies to release their own “clean” or “lawsuit-proof” datasets and AI services.
My full take: “The Getty generative AI bot and lawsuit-free datasets”