大型语言模型作为非结构化文本数据的判定者的潜力与危险
Potential and Perils of Large Language Models as Judges of Unstructured Textual Data
January 14, 2025
作者: Rewina Bedemariam, Natalie Perez, Sreyoshi Bhaduri, Satya Kapoor, Alex Gil, Elizabeth Conjar, Ikkei Itoku, David Theil, Aman Chadha, Naumaan Nayyar
cs.AI
摘要
大型语言模型的快速发展已经解锁了处理和总结非结构化文本数据方面的显著能力。这对于分析丰富的、开放式数据集(如调查回应)具有重要意义,其中语言模型具有高效提炼关键主题和情感的潜力。然而,随着组织越来越多地依赖这些强大的人工智能系统来理解文本反馈,一个关键问题出现了:我们能相信语言模型能准确地代表这些基于文本的数据集中包含的观点吗?虽然语言模型擅长生成类似人类的摘要,但存在这样一个风险,即它们的输出可能无意中偏离原始回应的真实内容。语言模型生成的输出与数据中实际主题之间的差异可能导致错误的决策,对组织产生深远影响。本研究调查了将语言模型作为评判模型来评估其他语言模型生成的摘要与主题的一致性的有效性。我们利用Anthropic Claude模型从开放式调查回应中生成主题摘要,亚马逊的Titan Express、Nova Pro和Meta的Llama作为语言模型评判者。将语言模型作为评判者的方法与使用Cohen's kappa、Spearman's rho和Krippendorff's alpha的人类评估进行了比较,验证了一种可扩展的替代传统以人为中心的评估方法。我们的研究结果显示,虽然语言模型作为评判者提供了一种可比拟人类评分者的可扩展解决方案,但人类仍然擅长发现微妙的、上下文特定的细微差别。这项研究有助于AI辅助文本分析领域的知识不断增长。我们讨论了局限性,并提出了未来研究的建议,强调在推广语言模型评判模型到各种情境和用例时需要谨慎考虑。
English
Rapid advancements in large language models have unlocked remarkable
capabilities when it comes to processing and summarizing unstructured text
data. This has implications for the analysis of rich, open-ended datasets, such
as survey responses, where LLMs hold the promise of efficiently distilling key
themes and sentiments. However, as organizations increasingly turn to these
powerful AI systems to make sense of textual feedback, a critical question
arises, can we trust LLMs to accurately represent the perspectives contained
within these text based datasets? While LLMs excel at generating human-like
summaries, there is a risk that their outputs may inadvertently diverge from
the true substance of the original responses. Discrepancies between the
LLM-generated outputs and the actual themes present in the data could lead to
flawed decision-making, with far-reaching consequences for organizations. This
research investigates the effectiveness of LLMs as judge models to evaluate the
thematic alignment of summaries generated by other LLMs. We utilized an
Anthropic Claude model to generate thematic summaries from open-ended survey
responses, with Amazon's Titan Express, Nova Pro, and Meta's Llama serving as
LLM judges. The LLM-as-judge approach was compared to human evaluations using
Cohen's kappa, Spearman's rho, and Krippendorff's alpha, validating a scalable
alternative to traditional human centric evaluation methods. Our findings
reveal that while LLMs as judges offer a scalable solution comparable to human
raters, humans may still excel at detecting subtle, context-specific nuances.
This research contributes to the growing body of knowledge on AI assisted text
analysis. We discuss limitations and provide recommendations for future
research, emphasizing the need for careful consideration when generalizing LLM
judge models across various contexts and use cases.Summary
AI-Generated Summary