시선-LLE: 대규모 학습된 인코더를 통한 시선 대상 추정

초록

우리는 시선 대상 추정 문제를 다루며, 이는 사람이 장면에서 어디를 보고 있는지 예측하는 것을 목표로 합니다. 사람의 시선 대상을 예측하기 위해서는 사람의 외모와 장면의 내용에 대해 추론해야 합니다. 이전 연구들은 장면 인코더, 머리 인코더, 깊이와 자세와 같은 신호를 위한 보조 모델에서 특징을 신중하게 퓨전하는 복잡한 수작업 파이프라인을 개발해 왔습니다. 시각적 작업의 다양한 일반 목적 특징 추출기의 성공을 바탕으로, 우리는 Gaze-LLE이라는 새로운 트랜스포머 프레임워크를 제안하여 얼어붙은 DINOv2 인코더의 특징을 활용해 시선 대상 추정을 간소화합니다. 우리는 장면을 위한 단일 특징 표현을 추출하고 가벼운 모듈을 사용해 사람별 위치 프롬프트를 적용하여 시선을 디코딩합니다. 우리는 여러 시선 벤치마크에서 최고 수준의 성능을 보여주며, 설계 선택의 타당성을 검증하기 위해 포괄적인 분석을 제공합니다. 우리의 코드는 다음에서 확인할 수 있습니다: http://github.com/fkryan/gazelle .

English

We address the problem of gaze target estimation, which aims to predict where a person is looking in a scene. Predicting a person's gaze target requires reasoning both about the person's appearance and the contents of the scene. Prior works have developed increasingly complex, hand-crafted pipelines for gaze target estimation that carefully fuse features from separate scene encoders, head encoders, and auxiliary models for signals like depth and pose. Motivated by the success of general-purpose feature extractors on a variety of visual tasks, we propose Gaze-LLE, a novel transformer framework that streamlines gaze target estimation by leveraging features from a frozen DINOv2 encoder. We extract a single feature representation for the scene, and apply a person-specific positional prompt to decode gaze with a lightweight module. We demonstrate state-of-the-art performance across several gaze benchmarks and provide extensive analysis to validate our design choices. Our code is available at: http://github.com/fkryan/gazelle .

시선-LLE: 대규모 학습된 인코더를 통한 시선 대상 추정

Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders

초록

Support