ChatPaper.aiChatPaper

语言复杂度测量作为评估LLM性能的噪声零样本代理

Language Complexity Measurement as a Noisy Zero-Shot Proxy for Evaluating LLM Performance

February 17, 2025
作者: Birger Moell, Johan Boye
cs.AI

摘要

大型语言模型(LLMs)在自然语言生成方面取得了重大进展,但在需要精确计算和结构分析的任务中常常面临挑战。本文通过计算LIX可读性度量和平均依赖距离(ADD),研究了最先进的LLMs在语言复杂度测量任务中的表现。我们使用瑞典高中和大学级散文,评估模型计算LIX分数和执行依赖解析的能力,将它们的结果与已建立的基准进行比较。我们的研究发现,虽然所有模型都展示了一定的任务能力,但ChatGPT-o1-mini表现最为稳定,在LIX计算和依赖解析方面准确率最高。此外,我们观察到在计算LIX的准确性和在大规模多任务语言理解(MMLU)基准测试中的整体表现之间存在强有力的显著负相关性-0.875 p 0.026(N=6)。这些结果表明,语言复杂度测量能力可以作为评估LLMs一般能力的一种嘈杂的零样本代理,为模型评估提供了一种实用方法,无需大量基准测试数据集。
English
Large Language Models (LLMs) have made significant strides in natural language generation but often face challenges in tasks requiring precise calculations and structural analysis. This paper investigates the performance of state-of-the-art LLMs on language complexity measurement tasks, through the computation of the LIX readability metric and Average Dependency Distance (ADD). Using Swedish high school and university-level essays, we evaluate the models' abilities to compute LIX scores and perform dependency parsing, comparing their results to established ground truths. Our findings reveal that while all models demonstrate some capacity for these tasks, ChatGPT-o1-mini performs most consistently, achieving the highest accuracy in both LIX computation and dependency parsing. Additionally, we observe a strong significant correlation -0.875 p 0.026 (N=6) between the models' accuracy in computing LIX and their overall performance on the Massive Multitask Language Understanding (MMLU) benchmark. These results suggest that language complexity measurement abilities can serve as a noisy zero-shot proxies for assessing the general capabilities of LLMs, providing a practical method for model evaluation without the need for extensive benchmarking datasets.

Summary

AI-Generated Summary

PDF02February 18, 2025