SongGen: 텍스트-투-송 생성을 위한 단일 단계 자기회귀 트랜스포머

초록

텍스트-투-송 생성(Text-to-song generation)은 텍스트 입력으로부터 보컬과 반주를 생성하는 작업으로, 도메인의 복잡성과 데이터 부족으로 인해 상당한 도전 과제를 안고 있습니다. 기존 접근 방식은 다단계 생성 절차를 사용하는 경우가 많아, 번거로운 학습 및 추론 파이프라인을 초래합니다. 본 논문에서는 제어 가능한 노래 생성을 위해 설계된 완전 오픈소스 단일 단계 자기회귀 트랜스포머인 SongGen을 제안합니다. 제안된 모델은 가사와 악기 구성, 장르, 분위기, 음색 등 다양한 음악적 속성에 대한 세밀한 제어를 가능하게 하며, 보이스 클로닝을 위한 3초 길이의 참조 클립을 옵션으로 제공합니다. 통합된 자기회귀 프레임워크 내에서 SongGen은 두 가지 출력 모드를 지원합니다: 보컬과 반주를 직접 혼합하여 생성하는 혼합 모드와, 다운스트림 애플리케이션에서 더 큰 유연성을 위해 이를 별도로 합성하는 듀얼 트랙 모드입니다. 각 모드에 대해 다양한 토큰 패턴 전략을 탐구하여 주목할 만한 개선과 유의미한 통찰을 도출했습니다. 또한, 효과적인 품질 관리를 자동화한 데이터 전처리 파이프라인을 설계했습니다. 커뮤니티 참여와 향후 연구를 촉진하기 위해 모델 가중치, 학습 코드, 주석이 달린 데이터, 전처리 파이프라인을 공개할 예정입니다. 생성된 샘플은 프로젝트 페이지(https://liuzh-19.github.io/SongGen/)에서 확인할 수 있으며, 코드는 https://github.com/LiuZH-19/SongGen에서 제공될 예정입니다.

English

Text-to-song generation, the task of creating vocals and accompaniment from textual inputs, poses significant challenges due to domain complexity and data scarcity. Existing approaches often employ multi-stage generation procedures, resulting in cumbersome training and inference pipelines. In this paper, we propose SongGen, a fully open-source, single-stage auto-regressive transformer designed for controllable song generation. The proposed model facilitates fine-grained control over diverse musical attributes, including lyrics and textual descriptions of instrumentation, genre, mood, and timbre, while also offering an optional three-second reference clip for voice cloning. Within a unified auto-regressive framework, SongGen supports two output modes: mixed mode, which generates a mixture of vocals and accompaniment directly, and dual-track mode, which synthesizes them separately for greater flexibility in downstream applications. We explore diverse token pattern strategies for each mode, leading to notable improvements and valuable insights. Furthermore, we design an automated data preprocessing pipeline with effective quality control. To foster community engagement and future research, we will release our model weights, training code, annotated data, and preprocessing pipeline. The generated samples are showcased on our project page at https://liuzh-19.github.io/SongGen/ , and the code will be available at https://github.com/LiuZH-19/SongGen .

SongGen: 텍스트-투-송 생성을 위한 단일 단계 자기회귀 트랜스포머

SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation

초록

Support