F5-TTS: Een sprookjesverteller die vloeiende en getrouwe spraak namaakt met Flow-Matching.

Samenvatting

Dit artikel introduceert F5-TTS, een volledig niet-autoregressief tekst-naar-spraak systeem gebaseerd op flow-matching met Diffusion Transformer (DiT). Zonder complexe ontwerpen zoals een duurmodel, tekstencoder en foneemuitlijning nodig te hebben, wordt de tekstinvoer eenvoudigweg opgevuld met vuller tokens tot dezelfde lengte als de spraakinvoer, waarna denoising wordt uitgevoerd voor spraakgeneratie, wat oorspronkelijk haalbaar werd bevonden door E2 TTS. De oorspronkelijke opzet van E2 TTS maakt het echter moeilijk te volgen vanwege de trage convergentie en lage robuustheid. Om deze problemen aan te pakken, modelleren we eerst de invoer met ConvNeXt om de tekstrepresentatie te verfijnen, waardoor het gemakkelijk uitgelijnd kan worden met de spraak. We stellen verder een inferentie-tijd Sway Sampling-strategie voor, die aanzienlijk de prestaties en efficiëntie van ons model verbetert. Deze bemonsteringsstrategie voor flow-stap kan gemakkelijk worden toegepast op bestaande op flow-matching gebaseerde modellen zonder opnieuw te trainen. Ons ontwerp maakt snellere training mogelijk en bereikt een inferentie RTF van 0.15, wat aanzienlijk verbeterd is in vergelijking met state-of-the-art diffusie-gebaseerde TTS-modellen. Getraind op een openbare 100K uur meertalige dataset, vertoont onze Fairytaler Fakes Fluent en Faithful spraak met Flow-matching (F5-TTS) een zeer natuurlijke en expressieve zero-shot mogelijkheid, naadloze code-switching mogelijkheid en efficiëntie in snelheidsregeling. Demonstratiemonsters zijn te vinden op https://SWivid.github.io/F5-TTS. We stellen alle code en checkpoints beschikbaar om de ontwikkeling in de gemeenschap te bevorderen.

English

This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as duration model, text encoder, and phoneme alignment, the text input is simply padded with filler tokens to the same length as input speech, and then the denoising is performed for speech generation, which was originally proved feasible by E2 TTS. However, the original design of E2 TTS makes it hard to follow due to its slow convergence and low robustness. To address these issues, we first model the input with ConvNeXt to refine the text representation, making it easy to align with the speech. We further propose an inference-time Sway Sampling strategy, which significantly improves our model's performance and efficiency. This sampling strategy for flow step can be easily applied to existing flow matching based models without retraining. Our design allows faster training and achieves an inference RTF of 0.15, which is greatly improved compared to state-of-the-art diffusion-based TTS models. Trained on a public 100K hours multilingual dataset, our Fairytaler Fakes Fluent and Faithful speech with Flow matching (F5-TTS) exhibits highly natural and expressive zero-shot ability, seamless code-switching capability, and speed control efficiency. Demo samples can be found at https://SWivid.github.io/F5-TTS. We release all code and checkpoints to promote community development.

F5-TTS: Een sprookjesverteller die vloeiende en getrouwe spraak namaakt met Flow-Matching.

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

Samenvatting

Summary

Support

Support