大型語言模型作為非結構化文本數據的判斷者的潛力與危險
Potential and Perils of Large Language Models as Judges of Unstructured Textual Data
January 14, 2025
作者: Rewina Bedemariam, Natalie Perez, Sreyoshi Bhaduri, Satya Kapoor, Alex Gil, Elizabeth Conjar, Ikkei Itoku, David Theil, Aman Chadha, Naumaan Nayyar
cs.AI
摘要
大型語言模型的快速進展已經開啟了處理和總結非結構化文本數據的卓越能力。這對於分析豐富、開放式數據集(例如調查回應)具有重要意義,語言模型有望有效地提煉出關鍵主題和情感。然而,隨著組織越來越多地依賴這些強大的人工智慧系統來理解文本反饋,一個關鍵問題浮現:我們能相信語言模型能準確地代表這些文本數據集中所包含的觀點嗎?儘管語言模型擅長生成類似人類的摘要,但存在一個風險,即它們的輸出可能會無意中偏離原始回應的真實內容。語言模型生成的輸出與數據中實際主題之間的差異可能導致決策上的缺陷,對組織產生深遠影響。本研究探討了將語言模型作為評估其他語言模型生成摘要的評判模型的有效性。我們利用 Anthropica Claude 模型從開放式調查回應中生成主題摘要,Amazon 的 Titan Express、Nova Pro 和 Meta 的 Llama 則作為語言模型評判。將語言模型作為評判的方法與使用 Cohen's kappa、Spearman's rho 和 Krippendorff's alpha 的人工評估進行比較,驗證了一種可擴展的替代傳統以人為中心的評估方法。我們的研究結果顯示,雖然語言模型作為評判提供了一種可擴展的解決方案,與人類評分者相比,人類仍然擅長發現微妙、上下文特定的細微差異。本研究有助於 AI 輔助文本分析的知識體系不斷擴大。我們討論了限制並提出了未來研究的建議,強調在推廣語言模型評判模型時需要仔細考慮各種情境和用例。
English
Rapid advancements in large language models have unlocked remarkable
capabilities when it comes to processing and summarizing unstructured text
data. This has implications for the analysis of rich, open-ended datasets, such
as survey responses, where LLMs hold the promise of efficiently distilling key
themes and sentiments. However, as organizations increasingly turn to these
powerful AI systems to make sense of textual feedback, a critical question
arises, can we trust LLMs to accurately represent the perspectives contained
within these text based datasets? While LLMs excel at generating human-like
summaries, there is a risk that their outputs may inadvertently diverge from
the true substance of the original responses. Discrepancies between the
LLM-generated outputs and the actual themes present in the data could lead to
flawed decision-making, with far-reaching consequences for organizations. This
research investigates the effectiveness of LLMs as judge models to evaluate the
thematic alignment of summaries generated by other LLMs. We utilized an
Anthropic Claude model to generate thematic summaries from open-ended survey
responses, with Amazon's Titan Express, Nova Pro, and Meta's Llama serving as
LLM judges. The LLM-as-judge approach was compared to human evaluations using
Cohen's kappa, Spearman's rho, and Krippendorff's alpha, validating a scalable
alternative to traditional human centric evaluation methods. Our findings
reveal that while LLMs as judges offer a scalable solution comparable to human
raters, humans may still excel at detecting subtle, context-specific nuances.
This research contributes to the growing body of knowledge on AI assisted text
analysis. We discuss limitations and provide recommendations for future
research, emphasizing the need for careful consideration when generalizing LLM
judge models across various contexts and use cases.Summary
AI-Generated Summary