CoSTAast:面向多轮图像编辑的成本敏感型工具路径代理
CoSTAast: Cost-Sensitive Toolpath Agent for Multi-turn Image Editing
March 13, 2025
作者: Advait Gupta, NandaKiran Velaga, Dang Nguyen, Tianyi Zhou
cs.AI
摘要
如稳定扩散和DALLE-3等文本到图像模型在多轮图像编辑任务中仍面临挑战。我们将此类任务分解为一种工具使用的代理工作流(路径),通过不同成本的AI工具处理一系列子任务。传统搜索算法需要昂贵的探索来寻找工具路径。尽管大型语言模型(LLMs)具备子任务规划的先前知识,但可能缺乏对工具能力和成本的准确估计,难以确定每个子任务应使用何种工具。我们能否结合LLMs和图搜索的优势,找到成本效益高的工具路径?我们提出了一种三阶段方法“CoSTA*”,利用LLMs创建子任务树,帮助为给定任务修剪AI工具图,然后在小型子图上进行A*搜索以找到工具路径。为了更好地平衡总成本与质量,CoSTA*结合了每个工具在每个子任务上的两项指标来指导A*搜索。每个子任务的输出随后由视觉语言模型(VLM)评估,若失败则触发工具在该子任务上成本和质量的更新。因此,A*搜索能够快速从失败中恢复,探索其他路径。此外,CoSTA*能够在子任务间自动切换模态,实现更优的成本质量权衡。我们构建了一个具有挑战性的多轮图像编辑新基准,在此基准上,CoSTA*在成本和质量上均优于最先进的图像编辑模型或代理,并能根据用户偏好进行多样化的权衡。
English
Text-to-image models like stable diffusion and DALLE-3 still struggle with
multi-turn image editing. We decompose such a task as an agentic workflow
(path) of tool use that addresses a sequence of subtasks by AI tools of varying
costs. Conventional search algorithms require expensive exploration to find
tool paths. While large language models (LLMs) possess prior knowledge of
subtask planning, they may lack accurate estimations of capabilities and costs
of tools to determine which to apply in each subtask. Can we combine the
strengths of both LLMs and graph search to find cost-efficient tool paths? We
propose a three-stage approach "CoSTA*" that leverages LLMs to create a subtask
tree, which helps prune a graph of AI tools for the given task, and then
conducts A* search on the small subgraph to find a tool path. To better balance
the total cost and quality, CoSTA* combines both metrics of each tool on every
subtask to guide the A* search. Each subtask's output is then evaluated by a
vision-language model (VLM), where a failure will trigger an update of the
tool's cost and quality on the subtask. Hence, the A* search can recover from
failures quickly to explore other paths. Moreover, CoSTA* can automatically
switch between modalities across subtasks for a better cost-quality trade-off.
We build a novel benchmark of challenging multi-turn image editing, on which
CoSTA* outperforms state-of-the-art image-editing models or agents in terms of
both cost and quality, and performs versatile trade-offs upon user preference.Summary
AI-Generated Summary