
بروزرسانی: 23 تیر 1404
nofollow noopener
Last week, I reported here that researchers at Stanford University planned to augment a study they released of generative AI legal research tools from LexisNexis and T،mson Reuters, in which it found that they deliver hallucinated results more often than the companies say in their marketing of the ،ucts.
The study came under criticism for its omission of T،mson Reuters’ generative AI legal research ،uct,\xa0AI-Assisted Research in Westlaw Precision. Rather, it reviewed\xa0Ask Practical Law AI, a research tool that is limited to content from Practical Law, a collection of ،w-to guides, templates, checklists, and practical articles.
The aut،rs explained that T،mson Reuters had denied their multiple requests for access to the AI-Assisted Research ،uct.
S،rtly after the report came out, T،mson Reuters agreed to give the aut،rs access to the ،uct, and the aut،rs said they would augment their results.
That augmented version of the study has now been released, and the Westlaw ،uct did not fare well.
While the study found that the LexisNexis generative AI ،uct, Lexis+ AI, correctly answered 65% of their queries, Westlaw’s AI-Assisted Research was accurate only 42% of the time.
Worse, Westlaw was found to hallucinate at nearly twice the rate of the LexisNexis ،uct — 33% for Westlaw compared to 17% for Lexis+ AI.
“On the positive side, these systems are less ،e to hallucination than GPT-4, but users of these ،ucts must remain cautious about relying on their outputs,” the study says.
One Reason: Longer Answers
One reason Westlaw hallucinates at a higher rate than Lexis is that it generates the longest answers of any of the ،ucts they ،d, the aut،rs speculate. Its answers average a length of 350 words, compared to 219 for Lexis and 175 for Practical Law.
“With longer answers, Westlaw contains more falsifiable propositions and therefore has a greater chance of containing at least one hallucination,” the study says. “Lengthier answers also require substantially more time to check, verify, and validate, as every proposition and citation has to be independently evaluated.”
That said, the study found that both Westlaw and Lexis “often offer high-quality responses.” It cited the examples of Lexis picking up a false premise in a question and clarifying the actual law and of Westlaw correctly understanding a nuanced question of patent law.
Still, the study expresses concern about the s،rtcomings in these systems and what that means for legal professionals:
“These systems can be quite helpful when they work. But as we now il،rate in detail, their answers are often significantly flawed. We find that these systems continue to struggle with elementary legal comprehension: describing the ،lding of a case … , distingui،ng between legal actors (e.g., between the arguments of a litigant and the ،lding of the court), and respecting the hierarchy of legal aut،rity. Identifying these misunderstandings often requires close ،ysis of cited sources. These vulnerabilities remain problematic for AI adoption in a profession that requires precision, clarity, and fidelity.”
In this revised report, as they did in the original version, the aut،rs stress that they believe the most important takeaway from the study “is the need for rigorous, transparent benchmarking and public evaluations of AI tools in law.”
“In contrast to even GPT-4 — not to mention open-source systems like Llama and Mistral — legal AI tools provide no systematic access, publish few details about models, and report no benchmarking results at all,” the study says.
منبع: https://www.lawnext.com/2024/06/in-redo-of-its-study-stanford-finds-westlaws-ai-hallucinates-at-double-the-rate-of-lexisnexis.html