Abstract

Recent breakthroughs in singing voice synthesis (SVS) have heightened the demand for high-quality annotated datasets, yet manual annotation remains prohibitively labor-intensive and resource-intensive. Existing automatic singing annotation (ASA) methods, however, primarily tackle isolated aspects of the annotation pipeline. To address this fundamental challenge, we present STARS, which is, to our knowledge, the first unified framework that simultaneously addresses singing transcription, alignment, and refined style annotation. Our framework delivers comprehensive multi-level annotations encompassing: (1) precise phoneme-audio alignment, (2) robust note transcription and temporal localization, (3) expressive vocal technique identification, and (4) global stylistic characterization including emotion and pace. The proposed architecture employs hierarchical acoustic feature processing across frame, word, phoneme, note, and sentence levels. The novel non-autoregressive local acoustic encoders enable structured hierarchical representation learning. Experimental validation confirms the framework’s superior performance across multiple evaluation dimensions compared to existing annotation approaches. Furthermore, applications in SVS training demonstrate that models utilizing STARS-annotated data achieve significantly enhanced perceptual naturalness and precise style control. This work not only overcomes critical scalability challenges in the creation of singing datasets but also pioneers new methodologies for controllable singing voice synthesis.

overall

ASA (Automatic Singing Annotation)

To assess the performance of STARS in the lyric and note alignment task, we randomly select samples to showcase. The images below represent the mel-spectrogram and phoneme/note segmentation results, while the second section shows the Dynamic Programming (DP) matrix, where the red lines indicate phoneme correspondence. In the third section, we show the f0, ground truth MIDI, and predicted MIDI.

Example 1

Word: 也许下个冬天 <AP> 也许还十年

Phoneme: ie x v x ia g e d ong t ian <AP> ie x v h ai sh i n ian

Mel-spectrogram and DP matrix

Note Transcription

Audio

Example 2

Word: 一次就好 <AP> 我带你去看天荒地老

Phoneme: i c i j iou h ao <AP> uo d ai n i q v k an t ian h uang d i l ao

Mel-spectrogram and DP matrix

Note Transcription

Audio

Example 3

Word: my head’s under water but <AP> i’m breathing fine <AP>

Phoneme: M AY1 HH EH1 D Z AH1 N D ER0 W AA1 T ER0 B AH1 T <AP> AY1 M B R IY1 DH IH0 NG IH1 N F AY1 N <AP>

Mel-spectrogram and DP matrix

Note Transcription

Audio

SVS (Singing Voice Synthesis)

We introduce a method to integrate global style and phoneme-level technique embeddings into the Singing Voice Synthesis (SVS) model to enable style control. The term Real refers to training with the ground truth labels from the dataset, Pred refers to training with our model’s annotations, and Mix refers to training with a combination of real data and our model-annotated data. During inference, we use the real dataset annotations.

Global Style Control

For global styles, we specify the following attributes for each test target:

Range: low, medium, high
Pace: slow, moderate, fast
Emotion: happy, sad

Phoneme-Level Technique Control

For phoneme-level styles, we assign one of the following techniques to each phoneme in the target content: mixed, falsetto, breathy, pharyngeal, vibrato, glissando, weak, strong, bubble.

Example 1

Word: <SP> 不再看天上太阳透过云彩的光

Global Style (range, pace, emotion): high, moderate, sad

Phoneme with Technique: <SP>(0), b(2), u(2), z(2), ai(2), k(2), an(2,6), an(2,6), t(2), ian(2), sh(2), ang(2), t(2), ai(2), iang(2), t(2), ou(2), g(2), uo(2,6), uo(2,6), vn(2), c(2), ai(2), d(2), e(2,6), e(2,6), g(2), uang(2)

(0: no technique, 1: mixed, 2: falsetto, 3: breathy, 4: pharyngeal, 5: vibrato, 6: glissando, 7: weak, 8: strong, 9: bubble)

Ground Truth

Ground Truth

Model Training Comparisons

Real Train	Mix Train	Pred Train

Example 2

Word: 在阳光灿烂的日子里开怀大笑

Global Style (range, pace, emotion): medium, fast, happy

Phoneme with Technique: z(8), ai(8), iang(6,8), iang(6,8), g(8), uang(8), c(8), an(6,8), an(6,8), l(8), an(8), d(8), e(8), r(8), i(8), z(8), i(8), l(8), i(8), k(8), ai(8), h(8), uai(8), d(8), a(6,8), a(6,8), x(8), iao(8)

(0: no technique, 1: mixed, 2: falsetto, 3: breathy, 4: pharyngeal, 5: vibrato, 6: glissando, 7: weak, 8: strong, 9: bubble)

Ground Truth

Ground Truth

Model Training Comparisons

Real Train	Mix Train	Pred Train

Example 3

Word: <SP> 远处蔚蓝天空下涌动着 <AP> 金色的麦浪

Global Style (range, pace, emotion): low, slow, happy

Phoneme with Technique: <SP>(0), van(3), ch(3), u(3), uei(3), l(3), an(3), t(3), ian(3), k(3), ong(3), x(3), ia(3), iong(3), d(3), ong(3), zh(3), e(3), <AP>(0), j(3), in(3), s(3), e(3), d(3), e(3), m(3), ai(3), l(3), ang(3)

(0: no technique, 1: mixed, 2: falsetto, 3: breathy, 4: pharyngeal, 5: vibrato, 6: glissando, 7: weak, 8: strong, 9: bubble)

Ground Truth

Ground Truth

Model Training Comparisons

Real Train	Mix Train	Pred Train

STARS: A Unified Framework for Singing Transcription, Alignment, and Refined Style Annotation

This is the implementation of the demo page of STARS.

Abstract

ASA (Automatic Singing Annotation)

Example 1

Example 2

Example 3

SVS (Singing Voice Synthesis)

Global Style Control

Phoneme-Level Technique Control

Example 1

Ground Truth

Model Training Comparisons

Example 2

Ground Truth

Model Training Comparisons

Example 3

Ground Truth

Model Training Comparisons