トークン予算を考慮したLLM推論

要旨

大規模言語モデル（LLM）が様々なタスクで優れるためには、推論が重要です。Chain-of-Thought（CoT）推論などの手法は、問題を中間ステップに分解することでLLMの性能を向上させますが、トークンの使用量が増加し、コストが上昇するという重大なオーバーヘッドも発生します。現在のLLMの推論プロセスは不必要に長く、プロンプトに適切なトークン予算を含めることで圧縮できることが分かりましたが、トークン予算の選択が実際の圧縮効果に重要な役割を果たします。そこで、推論の複雑さに基づいて異なる問題に対するトークン予算を動的に推定し、推論プロセスを誘導するために推定されたトークン予算を使用するトークン予算に注意したLLM推論フレームワークを提案します。実験の結果、当社の手法はCoT推論においてトークンコストを効果的に削減し、僅かな性能低下のみで、LLM推論における効率と精度のバランスを提供する実用的な解決策となります。コード：https://github.com/GeniusHTX/TALE.

English

Reasoning is critical for large language models (LLMs) to excel in a wide range of tasks. While methods like Chain-of-Thought (CoT) reasoning enhance LLM performance by decomposing problems into intermediate steps, they also incur significant overhead in token usage, leading to increased costs. We find that the reasoning process of current LLMs is unnecessarily lengthy and it can be compressed by including a reasonable token budget in the prompt, but the choice of token budget plays a crucial role in the actual compression effectiveness. We then propose a token-budget-aware LLM reasoning framework, which dynamically estimates token budgets for different problems based on reasoning complexity and uses the estimated token budgets to guide the reasoning process. Experiments show that our method effectively reduces token costs in CoT reasoning with only a slight performance reduction, offering a practical solution to balance efficiency and accuracy in LLM reasoning. Code: https://github.com/GeniusHTX/TALE.

トークン予算を考慮したLLM推論

Token-Budget-Aware LLM Reasoning

要旨

Summary

Support

Support