写輪眼：從桌面錄製中提取使用者操作序列

摘要

使用者活動的影片錄影，尤其是桌面錄影，提供了豐富的數據來源，用於理解使用者行為並自動化流程。然而，儘管視覺語言模型（VLMs）的進展以及它們在影片分析中的日益使用，從桌面錄影中提取使用者動作仍然是一個未被充分探索的領域。本文通過提出兩種基於VLM的新方法來提取使用者動作來填補這一空白：直接基於幀的方法（DF），將採樣的幀直接輸入VLM，以及差異基於幀的方法（DiffF），通過計算機視覺技術檢測到的明確幀差異進行融合。我們使用基本的自行製作數據集和從先前工作中調整的先進基準來評估這些方法。我們的結果顯示，DF方法在識別使用者動作方面達到了70%至80%的準確率，提取的動作序列可以透過機器人流程自動化重新播放。我們發現，雖然VLMs具有潛力，但融入明確的UI變化可能會降低性能，使DF方法更可靠。這項工作代表了首次應用VLMs從桌面錄影中提取使用者動作序列，為未來研究提供了新的方法、基準和見解。

English

Video recordings of user activities, particularly desktop recordings, offer a rich source of data for understanding user behaviors and automating processes. However, despite advancements in Vision-Language Models (VLMs) and their increasing use in video analysis, extracting user actions from desktop recordings remains an underexplored area. This paper addresses this gap by proposing two novel VLM-based methods for user action extraction: the Direct Frame-Based Approach (DF), which inputs sampled frames directly into VLMs, and the Differential Frame-Based Approach (DiffF), which incorporates explicit frame differences detected via computer vision techniques. We evaluate these methods using a basic self-curated dataset and an advanced benchmark adapted from prior work. Our results show that the DF approach achieves an accuracy of 70% to 80% in identifying user actions, with the extracted action sequences being re-playable though Robotic Process Automation. We find that while VLMs show potential, incorporating explicit UI changes can degrade performance, making the DF approach more reliable. This work represents the first application of VLMs for extracting user action sequences from desktop recordings, contributing new methods, benchmarks, and insights for future research.

写輪眼：從桌面錄製中提取使用者操作序列

Sharingan: Extract User Action Sequence from Desktop Recordings

摘要

Summary

Support