Large language models (LLMs) — generative AI models, like ChatGPT, that are trained to understand and produce human language content — have previously been known to “hallucinate” and generate false information.
However, the “pervasive” nature of legal hallucinations raises “significant concerns” about the reliability of using LLMs in the field, the authors from Stanford University’s Institute for Human-Centered AI and Regulation, Evaluation, and Governance Lab noted in a blog post.
When asked direct, verifiable questions about federal court cases, the study found the model behind ChatGPT, GPT-3.5, hallucinated 69 percent of the time.
Google’s PaLM 2 gave incorrect answers to legal queries 72 percent of the time, while Meta’s Llama 2 offered false information 88 percent of the time.
The models performed worse when asked more complex legal questions, such as the core legal question or central holding of a case, or when asked about case law from lower courts, like district courts.
They also frequently failed to contradict false premises in legal queries and tended to overstate their confidence in their responses, the study found.
“Today, there is much excitement that LLMs will democratize access to justice by providing an easy and low-cost way for members of the public to obtain legal advice,” the authors wrote in the blog post published Thursday.
“But our findings suggest that the current limitations of LLMs pose a risk of further deepening existing legal inequalities, rather than alleviating them,” they added.
Read more in a full report at TheHill.com.