What if the AI driving your business operations was built on stolen content? In a startling revelation, Meta’s AI has come under fire for allegedly using millions of pirated books to train their large language models (LLMs). This controversy highlights the risks and ethical dilemmas businesses face when deploying AI-driven solutions. Unsealed court documents reveal that Meta’s LLM, reportedly trained using Books3—an illicit dataset comprising copyrighted works from celebrated authors like Stephen King and Margaret Atwood—raises significant concerns about copyright infringement and trust.
The implications for companies relying on AI are profound. While Meta may not have directly committed the act of piracy, using such questionable data underscores the deeper issue of accountability in AI training. For businesses, this revelation is a crucial reminder to scrutinize their AI tools’ origins and ensure compliance with intellectual property laws. As AI models become more ubiquitous in customer service, content generation, and HR operations, the need for transparency and ethical considerations increases to prevent legal fallout and reputational damage.
As companies navigate the AI landscape, understanding the source of data used in AI models is paramount, especially in regulated sectors like healthcare and finance. This incident serves as a call to action for organizations to implement robust AI governance policies, demand vendor transparency, and push for ethical AI practices. Emphasizing trust and transparency in AI development not only safeguards business operations but positions companies strategically as regulatory landscapes evolve and ethical standards tighten.
Links & Resources:
Meta Secretly Trained Its AI on a Notorious Piracy Database, Newly Unredacted Court Docs Reveal – https://www.wired.com/story/new-documents-unredacted-meta-copyright-ai-lawsuit/
Watch the full episode:
AI Generated Full Transcript:
Brad Owens (00:00)
What if the AI that was driving your business was built on stolen content? We’re talking millions of pirated books that could be powering the LLM that you’re using.
Jennifer Owens (00:09)
So this is exactly what happened with Meta’s AI according to court documents that were unsealed, I think in January. But what this really is is a wake up call for every company deploying AI-driven labor.
Brad Owens (00:19)
Yeah, because it’s not about copyright things. Yes, that’s bad. But this is about more like trust and risk and really what happens when your AI vendor takes some ethical shortcuts.
Jennifer Owens (00:28)
Yeah,
so we’re gonna dive into this today on Digital Labor Lab. We’re gonna break down the scandal, we’re gonna break down what it means for the future of digital labor, and if we’re lucky, we’ll avoid me hopping on my soapbox to talk about authors and their intellectual property rights.
Welcome to Digital Labor Lab, where we are exploring the future of work one experiment at a time. I’m Jenny Owens.
Brad Owens (00:56)
And I’m Brad Owens. So let’s get you all on, on the same playing field as we are just to understand what happened. So there is a case that was filed against Meta by several different authors. court documents showed that Meta used a data set that’s known as what was this called Jenny books three. Right. So there we go. And he was used to train that llama AI model.
Jennifer Owens (01:13)
Books 3, I believe, it’s part of the LibGen database.
Mm-hmm.
Brad Owens (01:22)
So this data set contains something like over 190,000 books. Most of them were pirated. And many of those books are all still under copyright.
Jennifer Owens (01:32)
Yeah, and just to set the stage, like this is like minor, like barely known authors like Stephen King, Margaret Atwood, Zadie Smith, Colson Whitehead. Like these are major contemporary writers, right? This is not training on Project Gutenberg where stuff is in the public domain. is living authors whose work is still protected by copyright. And they had their work ingested into that machine learning system without their consent, without compensation, and without any real oversight. And I just want to pause here for a second because
One of the things that is interesting to me about this is that Meta used a pirated database for the training. So the actual act of piracy, the crime committed where we’re stripping the protections out of like an ebook and then loading that up was done by somebody else, but Meta still used the products of that crime to train their machine learning. Yeah.
Brad Owens (02:19)
And they didn’t disclose this,
which is big problem. came out kind of during discovery for this trial. And let’s kind of move beyond it. This is not just about taking pirated data and using it because it’s possible that meta didn’t even know that they were using that because what they were training their actual model on was the internet. So if something exists out there on the internet, it’s likely going to be ingested into one of these LLMs. And that could be copyrighted material or not copyrighted material.
So it’s not truly about just this one case. What it brought up to us and it got us thinking about is, wait, what does this actually mean for companies when they’re using things that their AI agents are going to take action from? I guess is the right word there. They’re going to take action from all of that copyrighted data. And now you’ve got your customer service, your content generation, your HR operations that are now interacting with your customers using pirated data.
Jennifer Owens (03:01)
Mm-hmm.
Mm-hmm. It’s a huge red flag. I think we’re not just talking about biased data or inaccurate outputs. We’re talking about data that’s been stolen. And so I’m thinking about all the papers that came out that we’re comparing, like the llama model versus this model versus that model. And part of that performance is based on data that could now be triggering lawsuits.
or fines, or even just a PR firestorm, right? Like you can have some serious reputational loss if this comes out that I trained my chat bot on the Handmaid’s Tale and Kujo and like all this stuff. So I wanna take a minute and zoom out a little bit. So the whole idea behind the crux of this podcast, behind Digital Labor is using AI agents and automation to improve efficiency, to scale, to reduce our costs.
But if the systems that we’re relying on are trained on illegal or ethically gray content, then I think we need to be really honest and open about what it is that we’re building with.
Brad Owens (04:15)
So think about if you are using AI that summarizing or generating content that’s, know, to your knowledge, unknowingly, coming from copyrighted work, could really expose your company to legal risks. And even if you never touch that data set, you could be held liable for using this model that was trained irresponsibly. Think about it like knowingly accepting stolen goods. If you showed up to a
A meet and greet for Facebook marketplace and you were buying something and you’re like this kind of seems fishy seems like a good deal I’m getting here that was stolen stuff surprise you committed a crime same sort of thing here. Just we’re talking the AI version
Jennifer Owens (04:53)
So,
as we were discussing this podcast concept, we were talking about how if I rob a bank and I leave the money on the corner, like taking that money is still a crime. So, and I think just to extend your metaphor, right? If I go on Facebook marketplace and I see this amazing deal on the Nintendo Switch 2 for like 75 bucks and I go and I purchase that and it turns out that it is stolen, what happens to my Switch 2 that I just purchased? The police take it, right? Like I don’t get to retain that. So.
I think it’s worth thinking about the source of the data that your AI models are trained on, especially in highly regulated sectors like healthcare, education, finance, law, where that intellectual property and the compliance with those regulations is really tightly, tightly controlled. So like for example, if I’m trying to put a model into a healthcare ecosystem that was trained on copyrighted textbooks or proprietary research, boy, that would make me really nervous. Like it’s making my palms sweat just to even think about it.
Brad Owens (05:52)
And this fuels the arms race for all of these companies that are trying to play into this zero sum model game, right? They’re trying to think we’re going to build the best model. It’s going to be the thing that everyone bases all of their other potential agents on. It’s going to be incredible. But if companies like Meta are gaining an advantage and maybe it’s performance wise, maybe it’s like output wise, whatever they’re actually trying to play towards through training on all of this copyrighted data, or we’ll just say like bad data.
just for sake of argument here. Maybe that creates an unfair advantage for them. Startups, someone who is trying to be an ethical company who doesn’t want to take that route may end up falling behind because all of these other companies are just like, well, the content was out there. I was just using it. I’m sorry that they didn’t password protect their data.
Jennifer Owens (06:37)
Mm-hmm.
I mean, they did. It’s just that somebody took it off. So I think that the thing that is really getting me is that the real kicker is that there’s no regulation yet. I feel like regulation is coming. We’ve covered this in previous episodes that we’re not seeing a strong federal direction on this in the states. But the EU has the EU AI Act. And the states are starting to build their own legislature, some of which is interesting and some of which is really going to hamper innovation.
But the other flavor now that is coming into play here is lawsuits, right? Do we really think that Stephen King is going to sit back and let people steal the Shining? The Shining is a masterpiece. It is a masterpiece, and it should not be used for free and without credit in training these models. So we’re going to see additional regulation, but we’re also going to see a lot of lawsuits. It’s really going to impact how you use these tools.
Brad Owens (07:36)
It’s gonna be a good time to be an AI lawyer.
Jennifer Owens (07:38)
It would be a great time to be an AI lawyer. AI lawyers call us. We want to talk. Yeah.
Brad Owens (07:42)
Yeah, absolutely we
do. So then, but yes, all of this happened. It’s not surprising that it happened. We have these gigantic models that are trained on the internet. There’s a lot of good and bad things out there on the internet.
Jennifer Owens (07:55)
My college blog is still out there, guys. You don’t wanna incorporate that.
Brad Owens (07:59)
Ooh, $10 to the first person that could find that blog.
Jennifer Owens (08:01)
No, it’s not hard to find.
No, do not put that out there.
Brad Owens (08:06)
All right. So then what should leaders do then? What business leaders, how do we take this thing that’s happening over here and try and understand how we should then be, adjusting maybe our AI policies or what we’re doing with AI in our business. What it comes down to really is you have to really dig deep into the tools that you’re using. Don’t just ask, Hey, what can this AI do? Really look at it and say, what is this AI trained on?
Jennifer Owens (08:35)
Mm-hmm.
Brad Owens (08:35)
So there’s kind of three questions that you can really ask of whether it’s your technology of how you’re using this. No, was this model trained using licensed or public domain content? That’s the easy one. If we were trying to, create a chat bot, we don’t want that chat bot to start spouting off Romeo and Juliet or something that’s actually in, sorry. Yeah, that’s out. so this actually currently licensed, yeah, no, I think that’s, that’s gone.
Jennifer Owens (08:56)
public domain.
The Shakespeare estate is not suing people over that.
Brad Owens (09:05)
So has the vendor provided to you a data lineage? Can they show a provenance statement of here’s where all of this came from? And then, ya know, something you could do before you sign on with a potential technology provider. are there known lawsuits that are actually already involving this company or that model or the data set?
something you could actually look up.
Jennifer Owens (09:27)
Yep, it’s true. It’s great due diligence, as tempting as it is to stick your head in the sand and be like, la la, they say it’s going to help me. I don’t want to think about how it was built. The other thing I would add is that building your internal governance will also build these skills to be asking these questions and thinking about these kinds of topics in a way that benefits your business. So you can create an AI review board or assign responsibility to a group of people in compliance or in legal or in other areas.
to vet AI vendors that your teams are using or even thinking about using. I would love if we could govern both before the moment of deployment and then after.
Brad Owens (10:03)
And this is a little bit more technical, but if you’re developing your own models, just be transparent, use synthetic data, all completely made up stuff. But AI is really good at anticipating what should be in those data sets. So if you’re using a synthetic data set, just be careful with it. license specific data sets. There’s people out there that truly have data sets that you can license to use to train your model.
Or you can participate in other open source initiatives. So the data sources are very clearly documented. They’re out there in the open of here’s what we actually used. Start using more responsible data.
Jennifer Owens (10:37)
Yeah. And another thing you can do is you can really push for vendor accountability. You can ask them for, like in health care, the Coalition for Health AI has these model cards that demonstrate, like, here’s what the model is intended to do. Here’s the data on which it was trained. And you can ask for those kind of, for a statement of how the model was trained, or you can ask them for a third party audit. I think we’re going to see a big uptick in.
AI certification, kind of like we see SOC 2 and other kind of security documentation, we’re going to see a huge uptick in AI documentation. If your vendor isn’t willing to work with you on that, it doesn’t have to be formal, but just like a discussion. If they’re not willing to open even just that tiniest crack into their process, that’s your signal to walk away.
Brad Owens (11:22)
You wouldn’t trust a third party software in your enterprise if they didn’t give you a very clear term of service, right? If they weren’t actually giving you the, the what behind what they’re actually doing. So just treat AI the exact same way. This is just another technology that you’re adding to your enterprise software landscape. It’s, it’s not just the tech that you’re adding though, when it comes to this stuff, you’re adding in liability.
Jennifer Owens (11:47)
Yep, every new tool comes with risk, right? And as more AI systems start generating content for your business, whether they’re generating, you know, like legal contracts or, you know, like summaries of things, you need to be confident that your tools aren’t built on a foundation of stolen intellectual property.
Brad Owens (12:01)
Let’s wrap all this up then. So this meta story, it’s not just about a big company getting caught, right? Yes, that’s going to make headlines and my gosh, Meta did this thing. But what this is really doing is it’s exposing more, more diligence that we need to do as users of AI, as your organization who is using these foundational models to power the rest of the things going on in your business. It’s just kind of.
getting a gigantic flashlight on this of, man, we might want to pay attention to a little bit more than just, hey, this tool can do this cool stuff.
Jennifer Owens (12:33)
Yeah, so our message and our position is very simple. Ethical AI is strategic AI. The faster you get ahead of this, the better positioned you are when regulations and when lawsuits hit and when your customers also start asking really tough questions.
Brad Owens (12:47)
Yeah. It’s your moment to be able to lead responsibly. That’s what you want to be known for as an organization. And that’s how we’re trying to help frame this up for you. So your competitors, they might cut corners and they may do some things faster, but in the long run, trust and transparency are going to win this.
Jennifer Owens (13:04)
So we’ll be watching in this space to see how this unfolds and sharing updates as new AI governance models and new lawsuits and new disclosures come out. Please do be sure to subscribe wherever you get your podcasts. We’re also on LinkedIn and on YouTube. And you can visit digitallabourlab.com for a full rundown of the content of this episode, as well as a really cool checklist that you can check out.
Brad Owens (13:28)
Yeah. So if you found this valuable, please share this out with your CTO, your general counsel, your procurement lead who’s finding all of your AI tech. This affects everyone that’s in that. So we’ll catch you on the next episode. Thanks so much for watching.
Jennifer Owens (13:41)
Thank you.