New York Times-ChatGPT lawsuit poses new legal threats to artificial intelligence

by Julia Shapero - 01/09/24 6:00 AM ET

AP Photo/Mark Lennihan, File
A sign for The New York Times hangs above the entrance to its building, May 6, 2021, in New York.

After a year of explosive growth, generative artificial intelligence (AI) may be facing its most significant legal threat yet from The New York Times.

The Times sued Microsoft and OpenAI, the company behind the popular ChatGPT tool, for copyright infringement shortly before the new year, alleging the companies impermissibly used millions of its articles to train their AI models.

The newspaper joins scores of writers and artists who have sued major technology companies in recent months for training AI on their copyrighted work without permission. Many of these lawsuits have hit road bumps in court.

However, experts believe The Times’s complaint is sharper than earlier AI-related copyright suits.

“I think they have learned from some of the previous losses,” Robert Brauneis, a professor of intellectual property law at the George Washington University Law School, told The Hill.

The Times lawsuit is “a little bit less scattershot in their causes of action,” Brauneis said.

“The attorneys here for the New York Times are careful to avoid just kind of throwing up everything against the wall and seeing what sticks there,” he added. “They’re really concentrated on what they think will stick.”

Transformation vs. Reproduction

Generative AI models require mass amounts of material for training. Large language models, like OpenAI’s ChatGPT and Microsoft’s Copilot, use the material they are trained on to predict what words are likely to follow a string of text to produce human-like responses.

Typically, these AI models are transformative in nature, said Shabbi Khan, co-chair for the Artificial Intelligence, Automation, and Robotics group at law firm Foley & Lardner.

“If you asked it a general query …. it doesn’t do a search and find the right passage and just reproduce the passage,” Khan explained. “It will try to probabilistically create its own version of what needs to be said based on a pattern that it picks up through parsing billions of words of content.”

However, in its suit against OpenAI and Microsoft, the Times alleges the AI models developed by the companies have “memorized” and can sometimes reproduce chunks of the newspaper’s articles.

“If individuals can access The Times’s highly valuable content through Defendants’ own products without having to pay for it and without having to navigate through The Times’s paywall, many will likely do so,” the lawsuit reads.

“Defendants’ unlawful conduct threatens to divert readers, including current and potential subscribers, away from The Times, thereby reducing the subscription, advertising, licensing, and affiliate revenues that fund The Times’s ability to continue producing its current level of groundbreaking journalism,” it adds.

In response to the lawsuit, an OpenAI spokesperson said in a statement that the company respects “the rights of content creators and owners” and is “committed to working with them to ensure they benefit from AI technology and new revenue models.”

While a Times spokesperson said the newspaper “recognizes the power and potential of GenAI for the public and for journalism,” they also emphasized that the AI models were built on “independent journalism and content that is only available because we and our peers reported, edited, and fact-checked it at high cost and with considerable expertise.”

“Settled copyright law protects our journalism and content,” the spokesperson added. “If Microsoft and OpenAI want to use our work for commercial purposes, the law requires that they first obtain our permission. They have not done so.”

Brauneis said some of the “most impressive” portions of the Times case are its repeated examples of the AI models simply regurgitating its material, nearly verbatim.

Earlier copyright lawsuits haven’t been able to show such direct reproductions of their material by the models, Khan noted.

In recent months, courts have dismissed claims from plaintiffs in similar lawsuits who argued that the outputs of particular AI models infringed on their copyright because they failed to show outputs that were substantially similar to their copyrighted work.

“I think [the Times] did a good job relative to what maybe other complaints have been put out in the past,” Khan told The Hill. “They provided multiple examples of basically snippets and quite frankly more than snippets, passages of the New York Times as reproductions.”

Khan suggested the court could decide that particular use cases of generative AI are not transformative enough and require companies to limit certain prompts or outputs to prevent AI models from reproducing copyrighted content.

While Brauneis similarly noted the issue could result in an injunction against the tech companies or damages for the Times, he also emphasized it is not an unsolvable issue for generative AI.

“I think that the companies will respond to that and develop filters that dramatically reproduce and reduce the incidence of that kind of output,” he said. “So, I don’t think that’s a long-term, huge problem for these companies.”

In an October response to an inquiry from the U.S. Copyright Office, OpenAI said it had developed measures to reduce the likelihood of “memorization” or verbatim repetition by its AI models, including removing duplications from its training data and teaching its models to decline prompts aimed at reproducing copyrighted works.

The company noted, however, “Because of the multitude of ways a user may ask questions, ChatGPT may not be perfect at understanding and declining every request aimed at getting outputs that may include some part of content the model was trained on.”

The AI model is also equipped with output filters that can block potentially violative content that is generated despite other safeguards, OpenAI said.

OpenAI also emphasized in a statement on Monday that memorization is a “rare bug” and alleged that the Times “intentionally manipulated prompts” in order to get ChatGPT to regurgitate its articles.

“Even when using such prompts, our models don’t typically behave the way The New York Times insinuates, which suggests they either instructed the model to regurgitate or cherry-picked their examples from many attempts,” the company said.

“Despite their claims, this misuse is not typical or allowed user activity, and is not a substitute for The New York Times,” it added. “Regardless, we are continually making our systems more resistant to adversarial attacks to regurgitate training data, and have already made much progress in our recent models.”

How the media, AI can shape each other

Carl Szabo, the vice president and general counsel of the tech industry group NetChoice, warned that lawsuits like the Times’ could stifle the industry.

“You’re gonna see a bunch of these efforts to kind of shakedown AI developers for money in a way that harms the public, that harms public access to information and kind of undermines the purpose of the Copyright Act, which is to promote human knowledge at the end of the day,” Szabo told The Hill.

Eventually, Khan said he thinks there will be a mechanism in place through which tech companies can obtain licenses to content, such as articles from the Times, for training their AI models.

OpenAI has already struck deals with The Associated Press and Axel Springer — a German media company that owns Politico, Business Insider and other publications — to use their content.

The Times also noted in its lawsuit that it reached out to Microsoft and OpenAI in April to raise intellectual property concerns and the possibility of an agreement, which OpenAI acknowledged in its statement about the case.

“Our ongoing conversations with the New York Times have been productive and moving forward constructively, so we are surprised and disappointed with this development,” a spokesperson said.

The OpenAI spokesperson added that the company is “hopeful that we will find a mutually beneficial way to work together.”

“I think most publishers will adopt that model because it provides for additional revenue to the company,” Khan told The Hill. “And we can see that because New York Times tried to enter into [an agreement]. So, there is a price that they’re willing to accept.”

Updated at 11:28 a.m.