Are your proprietary data and trade secrets at risk of becoming someone else’s AI training fodder? In the latest episode of Digital Labor Lab, hosts Brad Owens and his team dive deep into safeguarding your intellectual property against the ever-expanding appetite of AI. They address the crucial aspects of integrating security and privacy into your digital labor systems, discussing on-prem AI model deployment, private cloud solutions, and strategies for maintaining full control over your data.
As AI tools grow increasingly indispensable across industries from healthcare to finance, your internal documents, customer records, and strategy decks represent the crown jewels of your organization’s intellectual property. The discussion highlights the importance of building trusted digital labor systems that prevent your data from being unknowingly siphoned off to fine-tune external models. With significant entities like Samsung and Apple placing restrictions on AI usage due to similar concerns, the episode emphasizes constructing robust security frameworks and a secure architecture tailored to your organization’s needs.
From deploying AI models in-house to exploring the potential of locally fine-tuned open source models, such as Mistral and Falcon, the conversation explores practical steps to protect your sensitive information. The episode also covers the value of sandbox training environments and synthetic data sets to enhance security and operational efficiency. By incorporating audit trails, governance committees, and zero-trust architecture, organizations can safely harness digital labor and ensure compliance and ethical use of AI technologies.
Links & Resources:
Meta Secretly Trained Its AI on a Notorious Piracy Database, Newly Unredacted Court Docs Reveal – https://www.wired.com/story/new-documents-unredacted-meta-copyright-ai-lawsuit/
Watch the full episode:
AI Generated Full Transcript:
Brad Owens (00:00)
Last week, we uncovered how Meta’s AI was trained on a massive, massive trove of pirated books. But the question that we started taking away from that is, how then do you protect your own proprietary data for becoming someone else’s training set?
Jennifer Owens (00:14)
So on this episode of Digital Labor Lab, we’re going to talk strategies, on-prem AI, private model training, and how to bake security and privacy into your digital labor stack by design.
Brad Owens (00:33)
Welcome to Digital Labor Lab, where we explore the future of work one experiment at a time. I’m one of your hosts, Brad Owens. So whether you are a healthcare system, a law firm, a product company, your data is your IP, your internal documents, your customer records, your research that you’re doing, your strategy decks, you know, all of those.
Jennifer Owens (00:34)
you
I’m your other host, Jenny Owens.
Brad Owens (00:58)
are your crown jewel. These AI tools that you’re using though, they’re hungry for data.
Jennifer Owens (01:00)
And if.
Yes, they are. So just a quick note, IP is intellectual property for people who are not in this space all day, every day. I’m going to try to alternate between, I just want to clarify our acronyms. If you’re using cloud hosted or public large language model APIs without really strict controls, your proprietary information could be cached, stored, or used to fine tune someone else’s model, often without you knowing. They could be profiting off of your hard work, and you would have no way of knowing.
Brad Owens (01:33)
Yeah, we’re seeing headlines where companies are banning AI tools because of that stuff outright. We’ve got Samsung, Apple, JP Morgan. There’s a lot, but blocking AI is not going to be the answer. There’s going to be a way that people will use AI. So building out some trusted digital labor systems, that’s what you can really do to protect yourself into the future.
Jennifer Owens (01:55)
Yeah, and that starts with knowing where your AI lives, knowing what it’s trained on, and knowing how your data flows into and out of it. So we’re gonna spend some time breaking down how you can really build some of that trust and reliability in your own organization. So first of all, let’s talk architecture. One of the strongest moves you can make is deploying AI models on-prem in your own systems. That means the model is running inside your firewall, on your servers, on your terms.
subject to all of your company’s technical controls.
Brad Owens (02:24)
Yeah. And that’s kind of the ideal world, right? No data is going to leave your network at all. No cloud vendor with some kind of fuzzy privacy policy is going to take all of that data. It’s just complete control. However, we also understand that AI models right now take a whole lot of firepower and there’s not a lot of companies that are going to have just, I guess, some graphics processors on top of graphics processors that they can run their own local AI. Things are changing so fast that that’s going to be out of date in like a week.
So there’s not true things that we can do right now to say, hey, host it all yourself. You’re going to be an AI company now. We recognize it. That’s really not the thing that you truly can do.
Jennifer Owens (03:03)
So not today as of this recording, right? But look at how Deep Seek made so many waves just a few weeks ago. the reason why they were making so many waves is that Deep Seek saw a lot of the same, it performed really well in a lot of performance benchmarks without requiring the same processing power as a lot of the other models. So Deep Seek also kind of failed on a lot of security points. So I wouldn’t take that as like your gold standard.
But I do think that we’re seeing efficiency as a key competitive mechanism in the large language model market, which is really interesting. The open source models like Mistral or Falcon, those can be fine tuned locally. So you can also fine tune Deep Seek locally if you want to do a local install. You can get those powerful generative AI capabilities while keeping your sensitive data within your own ecosystem.
Brad Owens (03:50)
Yeah. And GPU advances are coming to like we just saw, Nvidia just came out with a completely new chip last week and that’s allowing kind of even midsize companies that can run these models locally and really make sure that they have secure private clouds to themselves. know, you don’t need Google silver farm to really do this stuff anymore. You can do a lot of it locally
Jennifer Owens (03:52)
Yes.
Yeah, and a bonus that really makes my governance-loving heart happy, the local models and the fine-tuning that you do on those, those are auditable. You know exactly what data went into training, and you know how the outputs are generated, which is really crucial for any sort of robust AI governance or monitoring program, or for regulatory compliance if you’re in a highly regulated field like finance or healthcare or education or law or like any other of the million highly regulated fields out there.
Brad Owens (04:37)
So I have to be honest, this goes a little against my shiny, bright AI can do everything. Ooh, I’m to play with this. Ooh, I’m going to play with this. Ooh, I’m going to play with this. It goes against my heart because I really like playing with the cool flashy stuff and all the new things, but I’m also not putting my entire corporation’s data on the internet to do that. I’m just having fun with, Oh, look, you can spell strawberry now. Like the little things. So it can now it did once at least. So.
Jennifer Owens (05:00)
I can’t it.
Brad Owens (05:06)
Let’s dig into at least how companies should be thinking about this when they’re training their models, like how to actually not expose all your intellectual property. So you start with your infrastructure. You have all of that locked down. You think, okay, no one has access to all of our data. That’s great. But when you start bringing in model training into your organization, you open yourselves up to completely new things that you haven’t really exposed all your data to before. So how then do you.
customize all of this without compromising what’s possible with AI.
Jennifer Owens (05:38)
Sure, and I’m bringing my bias to this as an output of all of my clinical research training and my role in healthcare, where my first reaction to any sort of like data sharing opportunity is to be like, no, no, you can’t have my data. I’m gonna put it all in this paper format so that nobody can read it, which is not super effective and it’s not a great way to innovate. So I’ve been fighting against that instinct for many years.
But a good way to indulge your conservative instincts while still getting to innovate is to start with something like differential privacy. So differential privacy refers to adding a certain amount of noise or randomness to a data set so that you could pull any real data point out of that and not compromise the overall structure of your data set. So I think about this like widening the bell curve, right? So we think about this.
If I’m looking at, I don’t know, the average education level of people in our particular county, that’s gonna be skewed because our county holds a lot of people who hold multiple post-graduate degrees. We have a lot of doctors, a lot of lawyers, a lot of folks who have had post-college education. So can we add a certain level of noise to that? Will we preserve that bump in the higher level of education? Well, saying that if my neighbor…
who collects degrees because it’s fun, if they dropped out of the data set, we’re not going to see then a sudden impact to that. That allows you to still query your data set and still preserve all of the useful features of that data set while preserving the privacy of any single individual. So if we pulled a single individual out of that data set and all of a sudden that number of degrees dropped, we know it’s somebody who has five or six degrees in our county. That’s a smaller list of people.
So you wanna make sure that you’re able to protect the privacy of the individuals who are making up your data while not compromising your ability to work with that data in a way that’s innovative and functional.
Brad Owens (07:28)
talking about AI employees here, when we start getting into digital labor, our employees that work for our business all have individual kind of role-based access to all of our systems. We’ve had this for years where we say, all right, this system is proprietary to just this, and this is proprietary to just that. So we do that through data classification. So we have these policies, maybe it’s just for sake of argument, it’s just a label that you add to a document. This is just an internal document.
This is a publicly available document or this is a confidential document. You know, you’re adding data classification labels to your individual data and you’re opening this up to AI the exact same way you’d open this up to an employee. You’re saying that this AI model or this digital labor employee has access to only this stuff because it’s labeled a certain way.
Jennifer Owens (08:18)
Right, I don’t need access to all of our pharmacy records. I don’t need access to all of our legal documents. I do need access to this particular production environment or that particular sandbox environment, which brings me to my next point, sandbox training environments. Boy, building this into your architecture will give you so much peace of mind. If you have a sandbox where you can do training, so no outbound data traffic, stuff only goes in and stays in.
No external logging tools. So nothing is writing, nothing is reading, nothing that you don’t know about is happening in this sandbox. This is just a sealed lab where your model can learn from your data without spilling it outwards.
Brad Owens (08:56)
picture Jenny in her basement with her laboratory of where she’s doing all her AI experiments. Like, that’s what we’re talking about. We’re talking about a corporate laboratory.
Jennifer Owens (09:05)
Yeah, that’s it. Also, why are you exposing my basement laboratory? That is supposed to be my special space. Yeah, it was labeled. Not for podcast discussion.
Brad Owens (09:11)
sorry, I didn’t look at the data classification label.
love it. all right. one extra point. I like to always throw out there as synthetic data sets when coming from my HR world, I never want to put any individuals, anything out there that could be exposed. So I’m always a big fan of synthetic data sets wherever possible. they’re ones that you can actually license or purchase from other companies, given our entire discussion, make sure that you know where they got all that data from or how they generated it. Yep.
Jennifer Owens (09:44)
for provenance. Yep.
Brad Owens (09:46)
So it gives you that ability to add in that realism without exposing any legitimate people or trade secrets.
Jennifer Owens (09:54)
Yeah. So now that we’ve got a sense of how we want to build our security and our privacy into our architecture and our training approach, let’s talk about digital labor strategy. If you were adding AI agents into your workforce, we need to think beyond the tech, right? So this is not just a model that we need to train. This is not just a security classification, but we’re also building a new layer of infrastructure. So it’s, going to, your cybersecurity folks are going to have to secure that just like you would any other mission critical system.
or maybe it’s not mission critical, right? If you have, let’s say you’ve got a person who is a person, a digital labor agent who is in charge of maintaining a document library, maybe that’s not mission critical. We call that, I always think about tiers, right? When I think about security. like tier zero is the stuff that must absolutely be on 24 seven. Tier one is the stuff that really has to be on, but we can handle like a drop of 30 seconds maybe. So think about tiering to your AI agents and make sure that your security structure
matches the criticality of that agent.
Brad Owens (10:55)
So what we’re talking about is security first, digital labor, right? We’re talking about the digital labor version of zero trust architecture. You start with no one has access and you only give specific access to specific things or people when they need it. So every AI agent, every automation tool, they should have their own identity verification. How do we know that this is actually this thing that’s accessing this data? And how do we make sure that the right data is just
restricted so that that thing cannot get access. And then we have behavioral monitoring as well.
Jennifer Owens (11:30)
Yeah. Audit trails are really critical. We do this with humans all the time. We have a log of every login, every prompt, like everything. We need to do this for our digital labor as well. You need to log. OK, I prompted the agent with this, and this is the actions that were taken. This is the output that we got. We need to make sure that you have a log that every time they touch a file, every time they make an edit, every time they do that, you would do this for humans. You should do this for your digital labor as well. That’s your paper trail for compliance and internal and external accountability.
Brad Owens (11:59)
And our favorite thing, can’t forget about governance.
Jennifer Owens (12:02)
Our favorite thing? Aww.
Brad Owens (12:04)
I’ve come around to your way of thinking. I want to use all the new shiny tools, but I realized that if it’s not just for me coming up with recipes, like our ranch chicken that we came up with a couple of weeks ago, which actually ended up being pretty good. if I’m using this for corporate things, we should probably think about governance. So think about an internal AI ethics and risk committee, define what is acceptable use of all of this data and you know,
Jennifer Owens (12:06)
Okay, okay.
That was Stuller, yes.
Yes.
Brad Owens (12:30)
you’re not going to restrict what AI people can use. There’s going to be something that they find access to. But just make sure you have at least an acceptable use. Give them escalation paths, which is a word I can never really say, but it allows them to move things up the chain as needed and review all new AI deployments like you would just if you were going to have a new hire or sign a new vendor contract.
Jennifer Owens (12:53)
Yeah, because at the end of the day, these digital labor resources are acting on behalf of your business. They are part of your labor force. You need to treat them like it. You would never hire a new person and just push them into the office and be like, I don’t know, you’ll figure it out. When I started at Cleveland Clinic, I got multiple days of what I lovingly refer to as Cleveland Clinic brainwashing. And it was everything from our mission, our vision, our guiding principles, all the way through here is training specific to your job role.
We need to do that for our digital labor assets immediately from day one.
Brad Owens (13:24)
Let’s wrap this all up then. So what we are advocating for here is on-prem when possible, or at least a private cloud AI that’s going to give you complete control over how your data is being used. And then fine tune open source models with synthetic data or with privacy tools that are in place by your organization to make sure that your data doesn’t get exposed to train these actual big models that are out there and then build in security from day one. We’re talking governance.
talking monitoring actually what’s happening with these digital employees and then train all of those things to make sure that they have security in mind from day one.
Jennifer Owens (14:04)
Yeah, the big scandal that MetaAI is currently embroiled in where they’re revealing that their model was trained on copyrighted works. This reminded us that AI is really only as ethical and secure as the system behind it. So you cannot, you cannot, cannot, cannot outsource trust. You have to build it in.
Brad Owens (14:20)
So if you want your own quick start guide to how you can secure your AI deployment, email us, email us to hello at digitallabourlab.com. We’re happy to hook you up with that checklist so that you can make sure that what you’re doing with AI is safe.
Jennifer Owens (14:33)
Yes, if this episode helped you rethink your AI strategy, share it with somebody in your company, or drop us a comment. Give us a subscribe. We are everywhere. The finest podcasts are sold. We’re on LinkedIn. We’re on YouTube. And we will see you next time.