ChatPaper.aiChatPaper

EgoLife:迈向以自我为中心的生活助手

EgoLife: Towards Egocentric Life Assistant

March 5, 2025
作者: Jingkang Yang, Shuai Liu, Hongming Guo, Yuhao Dong, Xiamengwei Zhang, Sicheng Zhang, Pengyun Wang, Zitang Zhou, Binzhu Xie, Ziyue Wang, Bei Ouyang, Zhengyu Lin, Marco Cominelli, Zhongang Cai, Yuanhan Zhang, Peiyuan Zhang, Fangzhou Hong, Joerg Widmer, Francesco Gringoli, Lei Yang, Bo Li, Ziwei Liu
cs.AI

摘要

我们推出EgoLife项目,旨在开发一款以自我为中心的智能生活助手,通过AI驱动的可穿戴眼镜来陪伴并提升个人效率。为奠定这一助手的基础,我们开展了一项全面的数据收集研究,六名参与者共同生活一周,持续记录他们的日常活动——包括讨论、购物、烹饪、社交和娱乐——使用AI眼镜进行多模态自我中心视角的视频捕捉,并同步记录第三人称视角的视频参考。这一努力成果便是EgoLife数据集,一个包含300小时自我中心视角、人际互动、多视角及多模态的日常生活数据集,并附有详尽的标注。依托此数据集,我们引入了EgoLifeQA,一套长上下文、生活导向的问答任务集,旨在通过解决诸如回忆过往相关事件、监测健康习惯及提供个性化建议等实际问题,为日常生活提供有意义的帮助。针对(1)开发适用于自我中心数据的鲁棒视听模型,(2)实现身份识别,以及(3)在广泛时间信息上支持长上下文问答等关键技术挑战,我们提出了EgoButler,一个集成系统,包含EgoGPT和EgoRAG。EgoGPT是一个在自我中心数据集上训练的全模态模型,在自我中心视频理解方面达到了业界领先水平。EgoRAG则是一个基于检索的组件,支持回答超长上下文问题。我们的实验研究验证了它们的工作机制,揭示了关键因素与瓶颈,为未来改进指明了方向。通过公开我们的数据集、模型和基准测试,我们期望激发自我中心AI助手领域的进一步研究。
English
We introduce EgoLife, a project to develop an egocentric life assistant that accompanies and enhances personal efficiency through AI-powered wearable glasses. To lay the foundation for this assistant, we conducted a comprehensive data collection study where six participants lived together for one week, continuously recording their daily activities - including discussions, shopping, cooking, socializing, and entertainment - using AI glasses for multimodal egocentric video capture, along with synchronized third-person-view video references. This effort resulted in the EgoLife Dataset, a comprehensive 300-hour egocentric, interpersonal, multiview, and multimodal daily life dataset with intensive annotation. Leveraging this dataset, we introduce EgoLifeQA, a suite of long-context, life-oriented question-answering tasks designed to provide meaningful assistance in daily life by addressing practical questions such as recalling past relevant events, monitoring health habits, and offering personalized recommendations. To address the key technical challenges of (1) developing robust visual-audio models for egocentric data, (2) enabling identity recognition, and (3) facilitating long-context question answering over extensive temporal information, we introduce EgoButler, an integrated system comprising EgoGPT and EgoRAG. EgoGPT is an omni-modal model trained on egocentric datasets, achieving state-of-the-art performance on egocentric video understanding. EgoRAG is a retrieval-based component that supports answering ultra-long-context questions. Our experimental studies verify their working mechanisms and reveal critical factors and bottlenecks, guiding future improvements. By releasing our datasets, models, and benchmarks, we aim to stimulate further research in egocentric AI assistants.

Summary

AI-Generated Summary

PDF352March 7, 2025