EgoNormia：物理社交规范理解的基准测试

摘要

人类行为受规范制约。在现实世界中行动时，人类不仅遵循规范，还会权衡不同规范之间的取舍。然而，机器在训练过程中往往缺乏对规范理解与推理的明确指导，尤其是当这些规范植根于物理和社会情境时。为了提升并评估视觉-语言模型（VLMs）的规范性推理能力，我们提出了EgoNormia |epsilon|，该数据集包含1,853段以自我为中心的人类互动视频，每段视频均配有两个相关问题，旨在评估对规范性行为的预测与合理性解释。这些规范性行为涵盖七大类别：安全、隐私、空间距离、礼貌、合作、协调/主动性以及沟通/清晰度。为大规模构建此数据集，我们设计了一套创新流程，结合视频采样、自动答案生成、筛选及人工验证。我们的研究表明，当前最先进的视觉-语言模型在规范理解方面表现欠佳，在EgoNormia上的最高得分仅为45%（相比之下，人类基准为92%）。通过对各维度性能的分析，我们揭示了将此类模型应用于现实世界代理时，在安全、隐私以及协作与沟通能力方面的显著风险。此外，我们还展示了一种基于检索的生成方法，利用EgoNomia能够有效增强视觉-语言模型的规范性推理能力。

English

Human activity is moderated by norms. When performing actions in the real world, humans not only follow norms, but also consider the trade-off between different norms However, machines are often trained without explicit supervision on norm understanding and reasoning, especially when the norms are grounded in a physical and social context. To improve and evaluate the normative reasoning capability of vision-language models (VLMs), we present EgoNormia |epsilon|, consisting of 1,853 ego-centric videos of human interactions, each of which has two related questions evaluating both the prediction and justification of normative actions. The normative actions encompass seven categories: safety, privacy, proxemics, politeness, cooperation, coordination/proactivity, and communication/legibility. To compile this dataset at scale, we propose a novel pipeline leveraging video sampling, automatic answer generation, filtering, and human validation. Our work demonstrates that current state-of-the-art vision-language models lack robust norm understanding, scoring a maximum of 45% on EgoNormia (versus a human bench of 92%). Our analysis of performance in each dimension highlights the significant risks of safety, privacy, and the lack of collaboration and communication capability when applied to real-world agents. We additionally show that through a retrieval-based generation method, it is possible to use EgoNomia to enhance normative reasoning in VLMs.

EgoNormia：物理社交规范理解的基准测试

EgoNormia: Benchmarking Physical Social Norm Understanding

摘要

Summary

Support

Support