SocialGPT: 탐욕적 세그먼트 최적화를 통해 사회적 관계 추론을 위한 LLMs 프롬프트

초록

사회 관계 추론은 이미지에서 친구, 배우자 및 동료와 같은 관계 범주를 식별하는 것을 목표로 합니다. 현재 방법들은 레이블이 지정된 이미지 데이터를 사용하여 전용 네트워크를 최종 단계로 교육하는 패러다임을 채택하고 있지만, 일반화 및 해석 가능성 측면에서 제한되어 있습니다. 이러한 문제를 해결하기 위해 먼저 Vision Foundation Models (VFMs)의 지각 능력과 Large Language Models (LLMs)의 추론 능력을 모듈식 프레임워크 내에서 결합하는 간단하면서도 신중하게 설계된 {\name} 프레임워크를 제안합니다. 이는 사회 관계 인식을 위한 강력한 기준을 제공합니다. 구체적으로, VFMs에게 이미지 콘텐츠를 텍스트 기반 사회 이야기로 번역하도록 지시하고, 그런 다음 LLMs를 통해 텍스트 기반 추론을 활용합니다. {\name}은 VFMs와 LLMs를 각각 조정하고 그 사이의 간극을 메우기 위한 체계적인 설계 원칙을 소개합니다. 추가 모델 교육 없이 두 데이터베이스에서 경쟁력 있는 제로샷 결과를 달성하면서, LLMs가 결정에 대한 언어 기반 설명을 생성할 수 있기 때문에 해석 가능한 답변을 제공합니다. 추론 단계에서 LLMs를 위한 수동 프롬프트 설계 과정은 지루하며 자동 프롬프트 최적화 방법이 필요합니다. 우리는 기본적으로 시각적 분류 작업을 LLMs의 생성 작업으로 변환하기 때문에 자동 프롬프트 최적화는 독특한 장문 프롬프트 최적화 문제에 직면합니다. 이 문제를 해결하기 위해 우리는 Greedy Segment Prompt Optimization (GSPO)을 제안합니다. 이는 세그먼트 수준에서 그레디언트 정보를 활용하여 탐욕스러운 탐색을 수행합니다. 실험 결과는 GSPO가 성능을 크게 향상시키며, 우리의 방법이 다양한 이미지 스타일로 일반화됨을 보여줍니다. 코드는 https://github.com/Mengzibin/SocialGPT에서 사용할 수 있습니다.

English

Social relation reasoning aims to identify relation categories such as friends, spouses, and colleagues from images. While current methods adopt the paradigm of training a dedicated network end-to-end using labeled image data, they are limited in terms of generalizability and interpretability. To address these issues, we first present a simple yet well-crafted framework named {\name}, which combines the perception capability of Vision Foundation Models (VFMs) and the reasoning capability of Large Language Models (LLMs) within a modular framework, providing a strong baseline for social relation recognition. Specifically, we instruct VFMs to translate image content into a textual social story, and then utilize LLMs for text-based reasoning. {\name} introduces systematic design principles to adapt VFMs and LLMs separately and bridge their gaps. Without additional model training, it achieves competitive zero-shot results on two databases while offering interpretable answers, as LLMs can generate language-based explanations for the decisions. The manual prompt design process for LLMs at the reasoning phase is tedious and an automated prompt optimization method is desired. As we essentially convert a visual classification task into a generative task of LLMs, automatic prompt optimization encounters a unique long prompt optimization issue. To address this issue, we further propose the Greedy Segment Prompt Optimization (GSPO), which performs a greedy search by utilizing gradient information at the segment level. Experimental results show that GSPO significantly improves performance, and our method also generalizes to different image styles. The code is available at https://github.com/Mengzibin/SocialGPT.

SocialGPT: 탐욕적 세그먼트 최적화를 통해 사회적 관계 추론을 위한 LLMs 프롬프트

SocialGPT: Prompting LLMs for Social Relation Reasoning via Greedy Segment Optimization

초록

Support