AI models frequently ‘hallucinate’ on legal queries, study finds

by Julia Shapero - 01/11/24 4:08 PM ET

Generative artificial intelligence (AI) models frequently produce false legal information, with so-called “hallucinations” occurring between 69 percent to 88 percent of the time, according to a recent study.

Large language models (LLMs) — generative AI models, like ChatGPT, that are trained to understand and produce human language content — have previously been known to “hallucinate” and generate false information.

However, the “pervasive” nature of legal hallucinations raises “significant concerns” about the reliability of using LLMs in the field, the authors from Stanford University’s Institute for Human-Centered AI and Regulation, Evaluation, and Governance Lab noted in a blog post.

When asked direct, verifiable questions about federal court cases, the study found the model behind ChatGPT, GPT-3.5, hallucinated 69 percent of the time, while Google’s PaLM 2 gave incorrect answers 72 percent of the time and Meta’s Llama 2 offered false information 88 percent of the time.

The models performed worse when asked more complex legal questions, such as the core legal question or central holding of a case, or when asked about case law from lower courts, like district courts.

They also frequently failed to contradict false premises in legal queries and tended to overstate their confidence in their responses, the study found.

“Today, there is much excitement that LLMs will democratize access to justice by providing an easy and low-cost way for members of the public to obtain legal advice,” the authors wrote in the blog post published Thursday. “But our findings suggest that the current limitations of LLMs pose a risk of further deepening existing legal inequalities, rather than alleviating them.”

“Ideally, LLMs would excel at providing localized legal information, effectively correct users on misguided queries, and qualify their responses with appropriate levels of confidence,” they added. “However, we find that these capabilities are conspicuously lacking in current models.”

The consequences of such hallucinations have already been seen in the legal field. A federal judge sanctioned two lawyers in June after one used fake case citations that were generated by ChatGPT.

Michael Cohen, former President Trump’s ex-fixer and personal lawyer, also admitted last month to giving his attorney fake case citations after using Google Bard, which ran on PaLM 2 until recently.

In his annual year-end report, Chief Justice John Roberts warned about the potential drawbacks of using AI in the legal field, even as he suggested that the technology could significantly affect judicial work in the future.

“Any use of AI requires caution and humility,” he noted. “One of AI’s prominent applications made headlines this year for a shortcoming known as ‘hallucination,’ which caused the lawyers using the application to submit briefs with citations to non-existent cases. (Always a bad idea.).”

Tags Artificial Intelligence ChatGPT generative AI Google John Roberts large language models Meta Michael Cohen