ビデオ生成におけるマルチサブジェクトのオープンセット個人化

要旨

ビデオパーソナライゼーション手法は、人物、ペット、場所など特定の概念を持つビデオを合成することを可能にします。しかし、既存の手法はしばしば限られた領域に焦点を当て、被写体ごとに時間のかかる最適化が必要であり、また単一の被写体のみをサポートしています。我々は、ビデオアルケミストを提案します - フォアグラウンドオブジェクトと背景の両方に組み込まれたマルチサブジェクト、オープンセットのパーソナライゼーション機能を備えたビデオモデルです。これにより、時間のかかるテスト時の最適化が不要となります。当モデルは、各条件付き参照画像とそれに対応する被写体レベルのテキストプロンプトをクロスアテンション層で融合する新しいディフュージョントランスフォーマーモジュールに基づいて構築されています。このような大規模なモデルを開発するには、データセットと評価という2つの主要な課題があります。まず、参照画像とビデオのペアデータセットを収集することは非常に困難なため、選択されたビデオフレームを参照画像としてサンプリングし、ターゲットビデオのクリップを合成します。ただし、モデルは参照フレームを与えられたトレーニングビデオを簡単にノイズ除去できますが、新しいコンテキストに汎化することができません。この問題を緩和するために、幅広い画像拡張を行う新しい自動データ構築パイプラインを設計しています。第二に、オープンセットのビデオパーソナライゼーションを評価すること自体が課題です。これに対処するために、正確な被写体の忠実度に焦点を当て、多様なパーソナライゼーションシナリオをサポートするパーソナライゼーションベンチマークを導入しています。最後に、我々の包括的な実験は、当手法が定量的および定性的評価の両方で既存のパーソナライゼーション手法を大幅に上回ることを示しています。

English

Video personalization methods allow us to synthesize videos with specific concepts such as people, pets, and places. However, existing methods often focus on limited domains, require time-consuming optimization per subject, or support only a single subject. We present Video Alchemist - a video model with built-in multi-subject, open-set personalization capabilities for both foreground objects and background, eliminating the need for time-consuming test-time optimization. Our model is built on a new Diffusion Transformer module that fuses each conditional reference image and its corresponding subject-level text prompt with cross-attention layers. Developing such a large model presents two main challenges: dataset and evaluation. First, as paired datasets of reference images and videos are extremely hard to collect, we sample selected video frames as reference images and synthesize a clip of the target video. However, while models can easily denoise training videos given reference frames, they fail to generalize to new contexts. To mitigate this issue, we design a new automatic data construction pipeline with extensive image augmentations. Second, evaluating open-set video personalization is a challenge in itself. To address this, we introduce a personalization benchmark that focuses on accurate subject fidelity and supports diverse personalization scenarios. Finally, our extensive experiments show that our method significantly outperforms existing personalization methods in both quantitative and qualitative evaluations.

ビデオ生成におけるマルチサブジェクトのオープンセット個人化

Multi-subject Open-set Personalization in Video Generation

要旨

Summary

Support