SocialGPT：通过贪婪分段优化提示LLM进行社会关系推理

摘要

社会关系推理旨在从图像中识别关系类别，如朋友、配偶和同事。虽然当前方法采用训练专用网络端到端使用带标签的图像数据的范例，但在泛化能力和可解释性方面存在局限性。为了解决这些问题，我们首先提出了一个名为{\name}的简单而精心设计的框架，它在一个模块化框架内结合了视觉基础模型（VFMs）的感知能力和大型语言模型（LLMs）的推理能力，为社会关系识别提供了一个强大的基准线。具体而言，我们指导VFMs将图像内容转化为文本社会故事，然后利用LLMs进行基于文本的推理。{\name}引入了系统化的设计原则，分别调整VFMs和LLMs，并弥合它们之间的差距。在没有额外模型训练的情况下，在两个数据库上实现了有竞争力的零样本结果，同时提供可解释的答案，因为LLMs可以为决策生成基于语言的解释。在推理阶段为LLMs设计手动提示的过程是繁琐的，需要一种自动化提示优化方法。由于我们实质上将一个视觉分类任务转化为LLMs的生成任务，自动提示优化遇到了独特的长提示优化问题。为了解决这个问题，我们进一步提出了贪婪分段提示优化（GSPO），通过利用段级别的梯度信息进行贪婪搜索。实验结果表明，GSPO显著提高了性能，我们的方法也适用于不同的图像风格。代码可在https://github.com/Mengzibin/SocialGPT找到。

English

Social relation reasoning aims to identify relation categories such as friends, spouses, and colleagues from images. While current methods adopt the paradigm of training a dedicated network end-to-end using labeled image data, they are limited in terms of generalizability and interpretability. To address these issues, we first present a simple yet well-crafted framework named {\name}, which combines the perception capability of Vision Foundation Models (VFMs) and the reasoning capability of Large Language Models (LLMs) within a modular framework, providing a strong baseline for social relation recognition. Specifically, we instruct VFMs to translate image content into a textual social story, and then utilize LLMs for text-based reasoning. {\name} introduces systematic design principles to adapt VFMs and LLMs separately and bridge their gaps. Without additional model training, it achieves competitive zero-shot results on two databases while offering interpretable answers, as LLMs can generate language-based explanations for the decisions. The manual prompt design process for LLMs at the reasoning phase is tedious and an automated prompt optimization method is desired. As we essentially convert a visual classification task into a generative task of LLMs, automatic prompt optimization encounters a unique long prompt optimization issue. To address this issue, we further propose the Greedy Segment Prompt Optimization (GSPO), which performs a greedy search by utilizing gradient information at the segment level. Experimental results show that GSPO significantly improves performance, and our method also generalizes to different image styles. The code is available at https://github.com/Mengzibin/SocialGPT.

SocialGPT：通过贪婪分段优化提示LLM进行社会关系推理

SocialGPT: Prompting LLMs for Social Relation Reasoning via Greedy Segment Optimization

摘要

Summary

Support

Support