The War of Copyright Over Training Data: Transformation or Theft?

written by Gülin Alkan ❂ 7 min read time

Every time an AI generates an image, a headline, or even a short essay for you, it isn't creating it out of nowhere. Rather, it draws from millions of works scattered across the internet, often without permission. Artificial intelligence learns by training on and absorbing every pixel of every image, every word of a text and any other media it can access. But where does, or should, innovation end and theft begin?

How Copyright Was Born: A Short History

While the tension between innovation and creators' rights may seem like a 21st century problem, it goes back to the eighteenth century, when in 1710 book publishers in London had finally had enough of pirated, unauthorized copies flooding the market and endangering their futures. The solution was revolutionary for its time: the Statute of Anne, the world's first copyright law.

The principle was simple yet elegant: give the creator exclusive rights to their work for a limited time period to encourage them to keep creating while ensuring the protection of their intellectual work, with it running out at the end of the period and the work becoming public domain. This was fair for both sides and kept each party's benefits in mind. For over three centuries, this balance between creator protection and public access has evolved, adapting to new technologies from the printing press all the way to the internet.

Our current copyright laws, while effective for a period of time, are proving to be less relevant and inadequate with each passing day. The concept of "fair use", which allows limited copying for content that is transformative in some way, was designed for a world of human actors making deliberate, small-scale uses of creative works. It assumes someone consciously decides to copy a book, sample a song, or reference a work of art.

The issue starts here: AI training, by nature, breaks all these assumptions. Instead of a human personally picking out and selecting excerpts to get inspired by or to transform from a few sources, machine learning algorithms automatically scan and process millions of works on the internet. Every pixel of every image, every word of every article, every note of every song they can access online is scraped at unprecedented scales, and the law is struggling to catch up.


The Battle Lines Are Drawn

What started as online debates on art forums has escalated to federal courtrooms. In February 2025, a Delaware court ruled that training an AI on copyrighted content can infringe on the owners' rights, directly challenging the "fair use" defense that many AI companies have relied upon. This case signaled that courts are starting to catch up to the AI age and take action.

However, the legal picture still remains mostly mixed. Some courts have sided with creators, while others backed AI companies on this. This patchwork of rulings emphasizes how uncharted this territory is, and let us watch as the law system learns to keep up with it.

Lately, we’ve been seeing a lot of prominent companies come under fire: Anthropic was sued for allegedly training their models on pirating platforms. The case was settled in September 2025 for $1.5 billion, marking one of the largest copyright settlements in history. Eight major U.S. newspapers sued Microsoft and OpenAI over training data use, and it seems these cases will only become more common over time. 

Amidst all these cases, the regulations are also being slowly introduced: the Generative AI Copyright Disclosure Act of 2024 promises to have AI companies disclose the copyrighted work used in training datasets. Similarly, the EU AI Act also requires AI companies to disclose copyrighted training data sources. While a big step, it's important to notice that these bills focus on transparency rather than creators' consent or any compensation for them, which should perhaps be the key focus in order to protect intellectual property and the creators of them going forward.


The Price Tag on Creativity

Last year, we saw three artists, namely Sarah Andersen, Kelly McKernan, and Karla Ortiz file a lawsuit against Midjourney, DeviantArt and Stability AI. In their case, they have alleged that these organizations have stolen millions of artists’ works by training their AI tools on five billion images scraped from the web without the original owners’ consent. 

The line between “stealing” and “borrowing” is very thin, and Andersen herself acknowledges this by stating that as an artist, she’s been influenced by anime art style, characters and internet culture. This is essentially how one creates: by looking around themselves in the world and combining what they experience in a melting pot, which makes it transformative. So what’s different when a bot, or a generative model tries to replicate this process?

Many argue the difference lies on the scale and consent of the “inspiration”: when a human borrows, it usually happens in intentional small picks. When an AI borrows, it happens at an industrial scale, and millions(or billions!) of works get scraped without intent or consent. 

As a consequence of these practices getting more common, artists are now forced to worry not just about their copyrighted material, but also their livelihood. In the UK, a survey found that 74% of creators are troubled by how their work may be used in AI training without their control, with 93% wanting credit and 94% wanting compensation. In Australia, over 80% of artists believe AI will hurt their income, and nearly three-quarters support a compensation scheme.

Another consequence is also visible now: the economic fallout with big companies testing how far they can push replacing humans. NCSOFT laid off portions of its art team and instead invested in AI tools. The same thing happened with Duolingo, the language learning app, which started replacing their employees with AI to create greater amounts of content in a shorter time. Freelancers on sites like Fiverr and Upwork are reporting fewer gigs because clients can just go on MidJourney or DALL·E and get something “good enough” in a few clicks. The “human cost” isn’t some abstract problem for later, but it’s artists right now being told they’re too expensive compared to machines trained on their own work. 


How To Protect Creators

We've seen how bad this has gotten: millions of creators with their work scraped without permission and companies building entire empires on that unpaid creative labor. So what can we actually do about it?

While there's no magic solution that's going to instantly fix the tension between AI innovation and creator's rights, there are some concrete steps we could take that would make a difference instead of letting this continue unchecked.

The Cookie Standards for Your Art

We already have a working model for regulating digital data: cookies. When they were first introduced in 1994, cookies were just a practical fix to keep shopping carts and login sessions working on early websites. But by the late 90s, companies realized they could also be used to track users across the web, which quickly spiraled into the surveillance advertising economy. Regulators stepped in during the 2000s, which is the reason cookie consent banners exist today. Thanks to these regulations, websites are required to disclose and get user permission before collecting data.

Why shouldn't the same system exist for creative work online? Having big platforms like Reddit, DeviantArt or forums have a built-in opt-in system on registration so users can be informed and give consent to their work being used as AI training data. With the scale that LLMs and gen-AI models are reaching, this can't stay a tech-nerd progressive niche anymore. Regulation has to start at least at the level of the major platforms.

Better Tracking with Digital Fingerprinting

Legal consent forms are one thing, but having a technology that makes it possible to track and enforce them is also a need. Some sites such as those who use subscription models, stock photo libraries and even some music services already embed metadata or digital watermarks to track ownership and usage. Extending a similar system to creative works more broadly, it would allow creators to be identified and compensated or excluded when their work is pulled into datasets.

Subsequently, AI companies could be required to run automated compliance checks that scan works for embedded IDs and record usage automatically and skip over the ones that are marked with a "no scrape" flag.

While this is a possible solution, there are two big challenges ahead. On the legal side, lawmakers would need to ensure companies actually respect these signals. The current systems such as robots.txt (a small file that tells web crawlers which pages to skip) and API restrictions work with opt-out systems, and companies can mostly ignore them without legal repercussions, although the EU AI Act has begun making such violations illegal. On the technical side, we'd need to make this metadata hard to strip or alter. Although being computationally heavy, it's not impossible.


Beyond Law and Code

At the end of the day, the problem goes beyond just a technical or legal fight. It's ultimately about the creative culture we want to have in the 21st century. It's about whether we're okay with real world art made by real world people being used as just training data for GenAI models. Do we really want a future where creativity gets flattened into endless “content”, or one where we cherish originality, authenticity and labor?


SOURCES:

  1. Statute of Anne 1710 | Wikipedia

  2. Thomson Reuters v. Ross Intelligence - Delaware Court Copyright Ruling | Schwabe Law

  3. Anthropic $1.5 Billion Settlement with Authors | NPR

  4. Artists Land Win in Class Action Against AI Companies | Artnet News

  5. Eight Newspapers Sue OpenAI and Microsoft | NPR

  6. Generative AI Copyright Disclosure Act Introduction | ASCAP

  7. EU AI Act Copyright Requirements | Pinsent Masons

  8. UK Creators Survey on AI - DACS | Arts Professional

  9. Australian Artists Survey on AI - NAVA | National Association for Visual Arts


Are you interested in AI safety and want to contribute to the conversation by becoming a writer for our blog? Then send an email to caethel(at)ais-saarland(dot)org with information about yourself, why you’re interested and a sample writing.

Next
Next

Cyber and Bio Threats in the Age of Intelligent Systems