EvMic:基於有效時空建模的事件驅動非接觸式聲音恢復
EvMic: Event-based Non-contact sound recovery from effective spatial-temporal modeling
April 3, 2025
作者: Hao Yin, Shi Guo, Xu Jia, Xudong XU, Lu Zhang, Si Liu, Dong Wang, Huchuan Lu, Tianfan Xue
cs.AI
摘要
當聲波撞擊物體時,會引發振動,從而產生高頻且細微的視覺變化,這些變化可用於還原聲音。早期研究總是在採樣率、帶寬、視野範圍以及光路簡潔性之間面臨取捨。近年來,事件相機硬件的進步顯示出其在視覺聲音還原應用中的巨大潛力,因為其捕捉高頻信號的能力尤為突出。然而,現有基於事件的振動還原方法在聲音還原方面仍不盡理想。在本研究中,我們提出了一種全新的非接觸式聲音還原流程,充分利用事件流中的時空信息。我們首先通過一種新穎的模擬流程生成大規模訓練集。接著,我們設計了一種網絡,利用事件的稀疏性來捕捉空間信息,並使用Mamba模型來建模長時序信息。最後,我們訓練了一個空間聚合模塊,以整合來自不同位置的信息,進一步提升信號質量。為了捕捉由聲波引起的事件信號,我們還設計了一套採用激光矩陣的成像系統,以增強梯度,並收集了多組數據序列進行測試。在合成數據和真實數據上的實驗結果證明了我們方法的有效性。
English
When sound waves hit an object, they induce vibrations that produce
high-frequency and subtle visual changes, which can be used for recovering the
sound. Early studies always encounter trade-offs related to sampling rate,
bandwidth, field of view, and the simplicity of the optical path. Recent
advances in event camera hardware show good potential for its application in
visual sound recovery, because of its superior ability in capturing
high-frequency signals. However, existing event-based vibration recovery
methods are still sub-optimal for sound recovery. In this work, we propose a
novel pipeline for non-contact sound recovery, fully utilizing spatial-temporal
information from the event stream. We first generate a large training set using
a novel simulation pipeline. Then we designed a network that leverages the
sparsity of events to capture spatial information and uses Mamba to model
long-term temporal information. Lastly, we train a spatial aggregation block to
aggregate information from different locations to further improve signal
quality. To capture event signals caused by sound waves, we also designed an
imaging system using a laser matrix to enhance the gradient and collected
multiple data sequences for testing. Experimental results on synthetic and
real-world data demonstrate the effectiveness of our method.Summary
AI-Generated Summary