凝视-LLE：通过大规模学习的编码器进行凝视目标估计

摘要

我们解决凝视目标估计问题，旨在预测一个人在场景中看向何处。预测一个人的凝视目标需要对人的外观和场景内容进行推理。先前的研究已经为凝视目标估计开发了越来越复杂的手工设计流水线，精心融合了来自独立场景编码器、头部编码器和用于深度和姿势等信号的辅助模型的特征。受通用特征提取器在各种视觉任务上取得成功的启发，我们提出了 Gaze-LLE，这是一个新颖的变压器框架，通过利用来自冻结的 DINOv2 编码器的特征简化了凝视目标估计。我们为场景提取单个特征表示，并应用一个特定于人的位置提示来使用轻量级模块解码凝视。我们展示了在几个凝视基准测试中的最先进性能，并提供了广泛的分析来验证我们的设计选择。我们的代码可在以下网址获取：http://github.com/fkryan/gazelle。

English

We address the problem of gaze target estimation, which aims to predict where a person is looking in a scene. Predicting a person's gaze target requires reasoning both about the person's appearance and the contents of the scene. Prior works have developed increasingly complex, hand-crafted pipelines for gaze target estimation that carefully fuse features from separate scene encoders, head encoders, and auxiliary models for signals like depth and pose. Motivated by the success of general-purpose feature extractors on a variety of visual tasks, we propose Gaze-LLE, a novel transformer framework that streamlines gaze target estimation by leveraging features from a frozen DINOv2 encoder. We extract a single feature representation for the scene, and apply a person-specific positional prompt to decode gaze with a lightweight module. We demonstrate state-of-the-art performance across several gaze benchmarks and provide extensive analysis to validate our design choices. Our code is available at: http://github.com/fkryan/gazelle .

凝视-LLE：通过大规模学习的编码器进行凝视目标估计

Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders

摘要

Summary

Support