在设备上的Sora：为移动设备实现基于扩散的文本到视频生成

摘要

我们提出了On-device Sora，这是一种首创性的解决方案，用于在智能手机设备上高效运行的基于扩散的设备端文本到视频生成。在Open-Sora的基础上，On-device Sora应用了三种新技术来解决计算和内存受限移动设备上基于扩散的文本到视频生成的挑战。首先，线性比例跃迁（LPL）通过高效的跃迁方法减少了视频扩散中需要的过多去噪步骤。其次，时间维度令牌合并（TDTM）通过沿时间维度合并连续令牌来最小化注意力层中的密集令牌处理计算。第三，具有动态加载的并发推理（CI-DL）动态将大型模型分区为较小块，并将其加载到内存中进行并发模型推理，有效解决了设备内存受限的挑战。我们在iPhone 15 Pro上实现了On-device Sora，并实验评估表明，它能够在设备上生成高质量视频，与在高端GPU上运行的Open-Sora生成的视频相媲美。这些结果表明，On-device Sora在资源受限的移动设备上实现了高效且高质量的视频生成，扩大了可访问性，确保了用户隐私，减少了对云基础设施的依赖，并降低了相关成本。我们将所提出的On-device Sora视为向民主化最先进生成技术迈出的重要一步，实现了在普通移动和嵌入式设备上具备视频生成能力。代码实现可在GitHub存储库上公开获取：https://github.com/eai-lab/On-device-Sora。

English

We present On-device Sora, a first pioneering solution for diffusion-based on-device text-to-video generation that operates efficiently on smartphone-grade devices. Building on Open-Sora, On-device Sora applies three novel techniques to address the challenges of diffusion-based text-to-video generation on computation- and memory-limited mobile devices. First, Linear Proportional Leap (LPL) reduces the excessive denoising steps required in video diffusion through an efficient leap-based approach. Second, Temporal Dimension Token Merging (TDTM) minimizes intensive token-processing computation in attention layers by merging consecutive tokens along the temporal dimension. Third, Concurrent Inference with Dynamic Loading (CI-DL) dynamically partitions large models into smaller blocks and loads them into memory for concurrent model inference, effectively addressing the challenges of limited device memory. We implement On-device Sora on the iPhone 15 Pro, and the experimental evaluations demonstrate that it is capable of generating high-quality videos on the device, comparable to those produced by Open-Sora running on high-end GPUs. These results show that On-device Sora enables efficient and high-quality video generation on resource-constrained mobile devices, expanding accessibility, ensuring user privacy, reducing dependence on cloud infrastructure, and lowering associated costs. We envision the proposed On-device Sora as a significant first step toward democratizing state-of-the-art generative technologies, enabling video generation capabilities on commodity mobile and embedded devices. The code implementation is publicly available at an GitHub repository: https://github.com/eai-lab/On-device-Sora.

在设备上的Sora：为移动设备实现基于扩散的文本到视频生成

On-device Sora: Enabling Diffusion-Based Text-to-Video Generation for Mobile Devices

摘要

Summary

Support