Outcry Grows Over AI Companies and Who Controls Internet’s Content

Websites like Reddit and writers including James Patterson and Sarah Silverman demand compensation for work they suspect was used to train new artificial-intelligence technology Photo Illustration by Emil Lendof/The Wall Street Journal; iStock Photo Illustration by Emil Lendof/The Wall Street Journal; iStock By Deepa Seetharaman and Keach Hagey July 30, 2023 5:30 am ET A collective cry is breaking out as authors, artists and internet publishers realize that the generative-AI phenomenon sweeping the globe is built partly on the back of their work. The emerging awareness has set up a war between the forces behind the inputs and the outputs of these new artif

A person who loves writing, loves novels, and loves life.Seeking objective truth, hoping for world peace, and wishing for a world without wars.

Opinion Jul 30, 2023 Add to Reading List

Outcry Grows Over AI Companies and Who Controls Internet’s Content

Websites like Reddit and writers including James Patterson and Sarah Silverman demand compensation for work they suspect was used to train new artificial-intelligence technology

Photo Illustration by Emil Lendof/The Wall Street Journal; iStock Photo Illustration by Emil Lendof/The Wall Street Journal; iStock

By Deepa Seetharaman and Keach Hagey

July 30, 2023 5:30 am ET

A collective cry is breaking out as authors, artists and internet publishers realize that the generative-AI phenomenon sweeping the globe is built partly on the back of their work.

The emerging awareness has set up a war between the forces behind the inputs and the outputs of these new artificial-intelligence tools, over whether and how content originators should be compensated. The disputes threaten to throw sand into the gears of the AI boom just as it seems poised to revolutionize the global economy.

Artificial-intelligence companies including OpenAI, its backer Microsoft, and Google built generative-AI systems such as ChatGPT by scraping oceans of information from the internet and feeding it into training algorithms that teach the systems to imitate human speech. The companies generally say their data use without compensation is permitted, but they have left the door open to discussing the issue with content creators.

Earlier in July, thousands of authors including Margaret Atwood and James Patterson signed an open letter demanding that top AI companies obtain permission and pay writers for the use of their works to train generative-AI models. Comedian Sarah Silverman and other authors also filed lawsuits against OpenAI and Facebook -parent Meta Platforms for allegedly training their AI models on illegal copies of their books that were captured and left on the internet.

News publishers have called the unlicensed use of their content a copyright violation. Some—including Wall Street Journal parent News Corp, Dotdash Meredith owner IAC and publishers of the New Yorker, Rolling Stone and Politico—have discussed with tech companies exploring ways they might be paid for the use of their content in AI training, according to people familiar with the matter.

The Associated Press and OpenAI announced a deal this month for the tech company to license stories in the AP archive.

President Biden announced a deal with tech companies to put more safeguards in AI technologies. “Realizing the promise of AI by managing the risk is going to require some new laws, regulations, and oversight,” said Biden. Photo: Andrew Caballero-Reynolds/AFP/Getty Images

Reddit, the social-discussion and news-aggregation site, has begun charging for some access to its content. Elon Musk has blamed AI companies scraping “vast amounts of data” for a recent decision by X, then called Twitter, to limit the number of tweets some users could view. And striking actors and writers have cited concerns that Hollywood studios could use AI to copy their likenesses or eliminate their jobs.

The escalating tensions reflect a broader rethinking of the value of writing and other online content, and how freely it should be swept up by large tech companies investing heavily in AI technologies that they expect to power future profits.

Patterson, one of the country’s most popular writers, said he found the idea that all of his novels—more than 200 of them—were likely ingested without his permission to train generative-AI software to do his job “frightening.”

“This will not end well for creatives,” he said in an interview.

Books constitute a sizable part of the training data for AI models, but the companies haven’t disclosed all the books their AI systems ingested and whether the list includes any still under copyright. Some authors say they suspect their books were used partly because the models can faithfully recount passages from various chapters. The complaints filed by Silverman and other authors allege that the companies trained their systems on illegal “shadow libraries” containing books under copyright.

OpenAI and Google have both said they train their AI models on “publicly available” information, a phrase that experts say encompasses a spectrum of content, including from paywalled and pirated sites. OpenAI also said in a statement that it respects the rights of creators and authors, and that many creative professionals use ChatGPT.

The fights hold the potential to put new limits or add considerable cost to accessing data that would radically change the business equation for these new AI tools.

SHARE YOUR THOUGHTS

Should people be compensated when their work is used to train AI? If so, how? Join the conversation below.

The lawsuits could force companies to build licensing into future data-collection practices, or require payment retroactively for copyright material used to train their models. Courts could require deletion of models that were built on top of such data, which would set AI work back years.

Limits on data would challenge how easily AI companies can build future versions of their language models. But the sheer size of these models also is a challenge for those seeking copyright protection, lawyers say.

“The cases are new and dealing with questions of a scale that we haven’t seen before,” said Mehtab Khan, a resident fellow at Yale Law School’s Information Society Project, which researches information law and policy. “The question becomes about feasibility. How will they reach out to every single author?”

The November launch of ChatGPT, with its mix of practical uses and quirky ability to write a script in Woody Allen’s style or explain string theory a la

Snoop Dogg, set off an explosion of interest in generative-AI tools, and an arms race among companies.

The power of chatbots such as ChatGPT stems from AI systems known as large language models. Companies can spend tens of millions of dollars or more to train some of the largest models, using data gathered with automated programs that vacuum up information from sites across the internet.

Tech companies have pointed to the legal doctrine of fair use, which permits the use of copyright material without permission under some circumstances, including if the end product is sufficiently different from the original work. AI proponents say free access to information is vital for technology that learns similarly to people and that has huge potential upsides for how we work and live.

“If a person can freely access and learn from information on the internet, I’d like to see AI systems allowed to do the same, and I believe this will benefit society,” said Andrew Ng, who invests in AI companies and runs an AI research lab at Stanford University.

There also is growing concern that AI systems could be used to replace screenwriters, journalists or novelists, who already make less money producing the work than the tech companies stand to make through training on that work.

AI leaders have generally said that while the technology might hurt some professions it will also create new kinds of jobs.

The Authors Guild, which published last week’s letter, has approached tech CEOs to discuss possible payment for training already done and licensing deals for authors that would pay them if they let language models mimic their work. Mary Rasenberger, CEO of the guild, said the conversations have been productive but need participation from all AI firms.

Rasenberger says the issue will persist because the companies need ever more information to advance their AI tools. Their models “aren’t even going to work in the future unless they continuously get fresh material,” she said.

A Google spokeswoman said it is “working to develop a better understanding of the business models for these products and on ways to give web publishers choice and control of their content.” She said Google would prioritize sending “valuable traffic” to news publishers as it develops AI tools.

The complaints and lawsuits in recent weeks build on legal challenges to earlier forms of generative AI that produce imagery and computer code.

In November, for example, a proposed class-action lawsuit was filed against OpenAI and Microsoft, along with its subsidiary GitHub, by lawyers working on behalf of GitHub users. They claimed that GitHub Copilot, a generative-AI tool used by programmers, violated open-source software licenses by reproducing licensed snippets of code without credit.

GitHub said it is committed to innovating responsibly and believes AI will “transform the way the world builds software, leading to increased productivity and most importantly, happier developers.”

Other proposed class-action lawsuits were filed separately against OpenAI, Microsoft and Google on behalf of internet users alleging that the companies’ scraping of websites to train their AI models violated the users’ privacy rights and copyrights.

OpenAI hasn’t revealed much about the data used to train its latest language model, GPT-4, citing competitive concerns. Its prior research papers show that earlier versions of its GPT model were trained in part on English-language Wikipedia pages and data scooped by a nonprofit called Common Crawl. It also trained its software using an OpenAI-compiled corpus of certain Reddit posts that received a user score, or “karma,” of at least three.

In April, Reddit, a key source of data for OpenAI and others building large language models, announced that it would start charging for direct, large-scale data access.

The AI systems “depend entirely on having a data set of quality work, made by humans, and if they collapse that market, their systems are going to collapse too,” said Matthew Butterick, a lawyer representing Sarah Silverman and several other parties suing tech companies over the use of their content to train generative AI. “They can’t bankrupt artists without bankrupting themselves.”

Write to Deepa Seetharaman at [email protected] and Keach Hagey at [email protected]