Sharingan：从桌面录制中提取用户操作序列

摘要

用户活动的视频记录，特别是桌面录制，为理解用户行为和自动化流程提供了丰富的数据来源。然而，尽管视觉语言模型（VLMs）的发展和在视频分析中的日益广泛应用，从桌面录制中提取用户操作仍然是一个未被充分探讨的领域。本文通过提出两种基于VLM的新方法来解决这一问题：直接基于帧的方法（DF），将采样帧直接输入VLMs，以及差分基于帧的方法（DiffF），通过计算机视觉技术检测到的显式帧差异。我们使用一个基本的自定义数据集和一个从先前工作中改编的先进基准来评估这些方法。我们的结果显示，DF方法在识别用户操作方面的准确率达到了70%至80%，提取的操作序列可通过机器人流程自动化进行重播。我们发现，虽然VLMs显示出潜力，但加入显式的用户界面更改可能会降低性能，使DF方法更可靠。这项工作代表了首次将VLMs应用于从桌面录制中提取用户操作序列，为未来研究提供了新的方法、基准和见解。

English

Video recordings of user activities, particularly desktop recordings, offer a rich source of data for understanding user behaviors and automating processes. However, despite advancements in Vision-Language Models (VLMs) and their increasing use in video analysis, extracting user actions from desktop recordings remains an underexplored area. This paper addresses this gap by proposing two novel VLM-based methods for user action extraction: the Direct Frame-Based Approach (DF), which inputs sampled frames directly into VLMs, and the Differential Frame-Based Approach (DiffF), which incorporates explicit frame differences detected via computer vision techniques. We evaluate these methods using a basic self-curated dataset and an advanced benchmark adapted from prior work. Our results show that the DF approach achieves an accuracy of 70% to 80% in identifying user actions, with the extracted action sequences being re-playable though Robotic Process Automation. We find that while VLMs show potential, incorporating explicit UI changes can degrade performance, making the DF approach more reliable. This work represents the first application of VLMs for extracting user action sequences from desktop recordings, contributing new methods, benchmarks, and insights for future research.

Sharingan：从桌面录制中提取用户操作序列

Sharingan: Extract User Action Sequence from Desktop Recordings

摘要

Support