ChatPaper.aiChatPaper

DeepSeek 對比 o3-mini:推理型大語言模型在機器翻譯與摘要任務評估中的表現如何?

DeepSeek vs. o3-mini: How Well can Reasoning LLMs Evaluate MT and Summarization?

April 10, 2025
作者: Daniil Larionov, Sotaro Takeshita, Ran Zhang, Yanran Chen, Christoph Leiter, Zhipin Wang, Christian Greisinger, Steffen Eger
cs.AI

摘要

具備推理能力的大型語言模型(LLMs)近期在複雜的邏輯與數學任務中展現了卓越性能,然而其在自然語言生成評估中的有效性尚未被探討。本研究系統性地比較了基於推理的LLMs(DeepSeek-R1與OpenAI o3)與其非推理版本在機器翻譯(MT)及文本摘要(TS)評估任務中的表現。我們評估了涵蓋三種架構類別的八個模型,包括最先進的推理模型、其蒸餾變體(參數量從8B到70B不等),以及對應的傳統非推理LLMs。我們在WMT23與SummEval基準上的實驗表明,推理能力帶來的益處高度依賴於模型與任務:雖然OpenAI o3-mini模型隨著推理強度的增加展現出持續的性能提升,但DeepSeek-R1在大多數情況下表現遜於其非推理版本,僅在TS評估的某些方面例外。相關性分析顯示,在o3-mini模型中,推理標記使用量的增加與評估質量呈正相關。此外,我們的結果表明,推理能力的蒸餾在中型模型(32B)中保持了合理的性能,但在較小變體(8B)中顯著下降。這項工作首次全面評估了推理LLMs在自然語言生成評估中的應用,並為其實際使用提供了洞見。
English
Reasoning-enabled large language models (LLMs) have recently demonstrated impressive performance in complex logical and mathematical tasks, yet their effectiveness in evaluating natural language generation remains unexplored. This study systematically compares reasoning-based LLMs (DeepSeek-R1 and OpenAI o3) with their non-reasoning counterparts across machine translation (MT) and text summarization (TS) evaluation tasks. We evaluate eight models across three architectural categories, including state-of-the-art reasoning models, their distilled variants (ranging from 8B to 70B parameters), and equivalent conventional, non-reasoning LLMs. Our experiments on WMT23 and SummEval benchmarks reveal that the benefits of reasoning capabilities are highly model and task-dependent: while OpenAI o3-mini models show consistent performance improvements with increased reasoning intensity, DeepSeek-R1 underperforms compared to its non-reasoning variant, with exception to certain aspects of TS evaluation. Correlation analysis demonstrates that increased reasoning token usage positively correlates with evaluation quality in o3-mini models. Furthermore, our results show that distillation of reasoning capabilities maintains reasonable performance in medium-sized models (32B) but degrades substantially in smaller variants (8B). This work provides the first comprehensive assessment of reasoning LLMs for NLG evaluation and offers insights into their practical use.

Summary

AI-Generated Summary

PDF32April 15, 2025