SocialGPT：通過貪婪分段優化促使LLM進行社交關係推理

摘要

社會關係推理旨在從圖像中識別關係類別，如朋友、配偶和同事。儘管當前方法採用訓練專用網絡的範式，端對端使用標記的圖像數據，但在泛化能力和可解釋性方面存在限制。為了解決這些問題，我們首先提出了一個簡單而精心設計的框架，名為「SocialGPT」，該框架結合了視覺基礎模型（VFMs）的感知能力和大型語言模型（LLMs）的推理能力，並在模塊化框架中提供了社會關係識別的強大基線。具體而言，我們指導VFMs將圖像內容轉化為文本社會故事，然後利用LLMs進行基於文本的推理。SocialGPT引入了系統化的設計原則，分別適應VFMs和LLMs並彌合它們之間的差距。在沒有額外模型訓練的情況下，在兩個數據庫上實現了具有競爭力的零樣本結果，同時提供可解釋的答案，因為LLMs可以為決策生成基於語言的解釋。在推理階段對LLMs進行手動提示設計過程繁瑣，需要一種自動化提示優化方法。由於我們基本上將視覺分類任務轉換為LLMs的生成任務，自動提示優化遇到獨特的長提示優化問題。為了解決這個問題，我們進一步提出了貪婪分段提示優化（GSPO），通過利用段級梯度信息執行貪婪搜索。實驗結果表明，GSPO顯著改善了性能，我們的方法還可以推廣應用於不同的圖像風格。代碼可在https://github.com/Mengzibin/SocialGPT找到。

English

Social relation reasoning aims to identify relation categories such as friends, spouses, and colleagues from images. While current methods adopt the paradigm of training a dedicated network end-to-end using labeled image data, they are limited in terms of generalizability and interpretability. To address these issues, we first present a simple yet well-crafted framework named {\name}, which combines the perception capability of Vision Foundation Models (VFMs) and the reasoning capability of Large Language Models (LLMs) within a modular framework, providing a strong baseline for social relation recognition. Specifically, we instruct VFMs to translate image content into a textual social story, and then utilize LLMs for text-based reasoning. {\name} introduces systematic design principles to adapt VFMs and LLMs separately and bridge their gaps. Without additional model training, it achieves competitive zero-shot results on two databases while offering interpretable answers, as LLMs can generate language-based explanations for the decisions. The manual prompt design process for LLMs at the reasoning phase is tedious and an automated prompt optimization method is desired. As we essentially convert a visual classification task into a generative task of LLMs, automatic prompt optimization encounters a unique long prompt optimization issue. To address this issue, we further propose the Greedy Segment Prompt Optimization (GSPO), which performs a greedy search by utilizing gradient information at the segment level. Experimental results show that GSPO significantly improves performance, and our method also generalizes to different image styles. The code is available at https://github.com/Mengzibin/SocialGPT.

SocialGPT：通過貪婪分段優化促使LLM進行社交關係推理

SocialGPT: Prompting LLMs for Social Relation Reasoning via Greedy Segment Optimization

摘要

Summary

Support

Support