What you see here is the last week’s worth of links and quips I have shared on LinkedIn, from Monday through Sunday.
For now I’ll post the notes as they appeared on LinkedIn, including hashtags and sentence fragments. Over time I might expand on these thoughts as they land here on my blog.
Creating and maintaining a clean, highly-curated dataset isn’t just a good practice for your company’s data analysis work. It’s also a form of risk management:
Bad data has always been the scourge of data projects. And it always will be. Bad data leads to failed projections and misbehaving ML models, which then cascade into business problems and lost revenue.
And with generative AI, those model problems can now become very public embarrassments. A chatbot that’s seen enough bad training data may “say” all kinds of inappropriate things while wearing your company logo, acting on behalf of your brand.
So what do you do? How do you limit the risk of bad data derailing your AI projects?
1/ When possible, collect the training data yourself. This way you’ll never have to question how it came to be.
2/ If you must acquire data from a vendor, make sure the vendor is a reputable source. You don’t want data that will land you in hot water. The same goes for data that is of low quality.
3/ Check the data to make sure it’s fit for purpose. Even if your data vendor is above-board and has sold you a high-quality dataset, it may still not be suitable for the project at hand.
Content moderation is hard. There’s just no two ways about it.
Doing content moderation at scale is even tougher because it requires some amount of technology – be it a rules engine, custom software, or AI models.
The trick? Machines work best with firm, unambiguous constructs while human speech is … well … we’re kind of all over the place.
I’m thinking about this again in relation to a recent move by Threads to block Covid-related search terms:
“Threads blocks searches related to covid and vaccines as cases rise” (Washington Post)
One point in that article really stood out to me. It’s in this excerpt:
“All this talk about AI and large language models and all these amazing technological innovations,” he said, “and one of the top tech companies in the world is resorting to these really crude instruments for content moderation.” […]
Blocking certain words from search outright is also ultimately ineffective, Farid said, because users will quickly develop euphemisms and turns of phrase to get around them.
Most people outside of the tech space don’t realize this, but “search” and “AI” have a lot in common. A lot.
Both are based on expressing a real-world concept as a series of numbers (a vector), then letting a computer compare tons of vectors in search of patterns and similarity.
Point being: “AI” may be considered the fancier, more advanced tool. But if a given problem trips up search, it’ll probably trip up AI. I don’t think Threads using AI-based content moderation would have done much better here than the search-based approach.
Interesting use of AI for translation: besides translating a film character’s words, update their lip movements to match:
“Meine Kopie spricht sieben Sprachen” (Die Zeit)