The Great Inversion: How AI Flipped Copyright Inside Out

When I started in content marketing, running a plagiarism check was part of our quality process. In one interview, my manager even rejected a candidate because his assignment showed traces of copied content. In the world of writing, plagiarism has always been the eighth sin.

Protecting the original expression sits at the heart of copyright law. But GenAI is inverting that framework in ways we haven’t fully grasped yet.

The Substantial Similarity Test

While tinkering with GenAI tools, I observed that no two pieces of GenAI writing are alike. Not even if you use the same prompt and tool. Doubt this? Go ahead and give a fairly generic prompt on a GenAI tool. Repeat the same prompt in a new window (same tool). You’ll observe that while the substance might be similar, the content is rarely the same.

In other words, GenAI will clear the substantial similarity test with flying colors. And what’s that? I stumbled upon this first in Mark Lemley’s paper How Generative AI Turns Copyright Upside Down.

A key method for proving copyright infringement, this test relies on comparing two works to see if the defendant copied the original’s expression.

With GenAI, this test becomes useless because the AI’s output is rarely a copy. The AI doesn’t literally copy its training data. As we already observed, the probability of an output being similar to the given input is abysmally low.

Significantly, you can’t use the similarity of the final expression to prove that the human copied a particular work. Because the AI, not the human, created the final work.

This leads to the second problem Lemley identifies, one that cuts even deeper.

The Idea-Expression Dichotomy

Copyright law has always protected the unique expression of an idea, not the idea itself. The copyright recognized the hard, creative work in the expression, the brushstrokes of a painting or the specific words of a poem.

With GenAI, the human user provides the idea (e.g., “a painting of a comet over a beach at sunset”), while the AI handles the work of expression (e.g., generating the image, choosing colors, and composition).

This moves the creative value from the output to the input. From the expression to the idea.

Since copyright law doesn’t protect ideas, this creates a new dilemma. Since current U.S. law doesn’t recognize non-human authors, whose work is it? And since the human’s contribution is the “idea” (prompt), can this be protected by copyright?

The paper suggests a new approach, where the user’s detailed prompt is considered the creative act. But this approach is legally difficult to defend, as seen in the U.S. Copyright Office’s hesitation in granting registrations for AI-generated works.

Lemley concludes that current copyright laws are ill-equipped to handle GenAI. While his diagnosis is precise, the real-world consequences are messier. Consider what happened at Anthropic.

The Books They Burned

Anthropic purchased millions of physical books, tore out the bindings, scanned the pages, and threw the originals away. Not digitized and preserved, but destroyed.

Why? The company wanted “all the books in the world” without the “legal/practice/business slog.” Their workaround was elegantly cynical: buy physical copies, invoke the first-sale doctrine, and shred the evidence.

It worked, at least initially.

Judge William Alsup, a federal judge in San Francisco, ruled that this destructive scanning constituted “fair use” under copyright law. The AI learned from books, not to copy them but to “create something different,” like “any reader aspiring to be a writer.” The judge compared destroying physical books after scanning to “conserving space.”

Does this metaphor of AI as an aspiring writer hold up to scrutiny? Does it adequately recognize the difference between a human learning narrative techniques and a machine processing millions of texts into statistical patterns? That’s a different debate.

Back to our story…

Before adopting this book-shredding approach, the company downloaded over seven million pirated books from sites like LibGen and Books3. That part, the judge ruled, was infringement. Not because of how the books were used, but because of how they were acquired. The first-sale doctrine protects what you do with property you’ve legitimately purchased. It’s not an alibi for theft.

The distinction matters. Anthropic ultimately settled the case for $1.5 billion, paying authors roughly $3,000 per book. The company’s lawyers had warned that statutory damages for willful infringement could reach “hundreds of billions of dollars”. Enough to adversely affect the company. So they settled.

What the Settlement Actually Means

The Anthropic settlement establishes what Lemley’s paper doesn’t fully address: provenance matters as much as use. Even if AI training constitutes transformative fair use (that question remains unsettled), the law won’t tolerate acquiring training data through piracy, even if the subsequent use would be legal.

Look at how strange this turns out to be. Anthropic can buy a book, destroy it, scan it, feed it to Claude, and call that fair use. But it cannot download that same book from LibGen and do the exact same thing. While companies that scraped pirated content face liability, companies that purchased or licensed their data have legal cover.

The settlement also reveals the gap between legal theory and practical outcomes. Yes, AI companies are winning the fair use argument. But they’re still writing billion-dollar checks to settle cases where they can’t defend their data sources. The law might eventually accommodate AI training, but it’s extracting a toll in the process.

What’s striking is how this sidesteps Lemley’s core concern. Even if every AI company scrupulously purchases its training data, we’re still left with the inversion he describes.

The human contributes the idea, the AI produces the expression, and copyright protects neither. The substantial similarity test remains broken. The legal framework still doesn’t fit.

The Paradox of AI’s Creative Destruction

AI companies are fighting to establish that training on copyrighted works is legal. In doing so, they’re creating a system where the output of that training, the actual creative work, has no legal protection. They’re securing the right to consume the old system while building a new one that makes ownership itself ambiguous.

The books Anthropic shredded are gone. They have now become patterns in Claude’s neural network. These patterns can’t be owned, can’t be traced, and can’t be meaningfully compared to any specific original. That’s precisely the inversion Lemley describes.

And we’re all working inside it now, whether we’ve noticed or not.

Leave a comment

Blog at WordPress.com.

Up ↑