Gaze-LLE：透過大規模學習的編碼器進行凝視目標估計

摘要

我們討論凝視目標估計的問題，旨在預測一個人在場景中的凝視位置。預測一個人的凝視目標需要對該人的外觀和場景內容進行推理。先前的研究已經為凝視目標估計開發了越來越複雜的手工設計流程，精心融合了來自不同場景編碼器、頭部編碼器和輔助模型（如深度和姿勢）的特徵。受通用特徵提取器在各種視覺任務上取得成功的啟發，我們提出了Gaze-LLE，一種新型的變壓器框架，通過利用凍結的DINOv2編碼器的特徵，簡化了凝視目標估計。我們提取了場景的單一特徵表示，並應用了一個特定於人的位置提示來解碼凝視，使用了輕量級模塊。我們展示了在幾個凝視基準測試中的最先進性能，並提供了廣泛的分析來驗證我們的設計選擇。我們的代碼可在以下網址找到：http://github.com/fkryan/gazelle。

English

We address the problem of gaze target estimation, which aims to predict where a person is looking in a scene. Predicting a person's gaze target requires reasoning both about the person's appearance and the contents of the scene. Prior works have developed increasingly complex, hand-crafted pipelines for gaze target estimation that carefully fuse features from separate scene encoders, head encoders, and auxiliary models for signals like depth and pose. Motivated by the success of general-purpose feature extractors on a variety of visual tasks, we propose Gaze-LLE, a novel transformer framework that streamlines gaze target estimation by leveraging features from a frozen DINOv2 encoder. We extract a single feature representation for the scene, and apply a person-specific positional prompt to decode gaze with a lightweight module. We demonstrate state-of-the-art performance across several gaze benchmarks and provide extensive analysis to validate our design choices. Our code is available at: http://github.com/fkryan/gazelle .

Gaze-LLE：透過大規模學習的編碼器進行凝視目標估計

Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders

摘要

Summary

Support