- DASpeech: Directed Acyclic Transformer for Fast and High-quality Speech-to-Speech Translation
- Translation Results
DASpeech: Directed Acyclic Transformer for Fast and High-quality Speech-to-Speech Translation
Direct speech-to-speech translation (S2ST) translates speech from one language into another using a single model. However, due to the presence of linguistic and acoustic diversity, the target speech follows a complex multimodal distribution, posing challenges to achieving both high-quality translations and fast decoding speeds for S2ST models. In this paper, we propose DASpeech, a non-autoregressive direct S2ST model which realizes both fast and high-quality S2ST. To better capture the complex distribution of the target speech, DASpeech adopts the two-pass architecture to decompose the generation process into two steps, where a linguistic decoder first generates the target text, and an acoustic decoder then generates the target speech based on the hidden states of the linguistic decoder. Specifically, we use the decoder of DA-Transformer as the linguistic decoder, and use FastSpeech 2 as the acoustic decoder. DA-Transformer models translations with a directed acyclic graph (DAG). To consider all potential paths in the DAG during training, we calculate the expected hidden states for each target token via dynamic programming, and feed them into the acoustic decoder to predict the target mel-spectrogram. During inference, we select the most probable path and take hidden states on that path as input to the acoustic decoder. Experiments on the CVSS benchmark demonstrate that DASpeech can achieve comparable or even better performance than the state-of-the-art S2ST model Translatotron 2, while preserving up to 18.53× speedup compared to the autoregressive baseline model. Compared with the previous non-autoregressive S2ST model, DASpeech does not rely on knowledge distillation and iterative decoding, achieving significant improvements in both translation quality and decoding speed. Furthermore, DASpeech shows the ability to preserve the speaker’s voice of the source speech during translation.
Audio samples are available at https://ictnlp.github.io/daspeech-demo/.
Paper:https://arxiv.org/abs/2310.07403.
Code:https://github.com/ictnlp/DASpeech.
Translation Results
DASpeech: Our model with joint-viterbi decoding (λ=0.5) .
Translatotron: Baseline model in Direct speech-to-speech translation with a sequence-to-sequence model
.
Translatotron 2: Baseline model in Translatotron 2: High-quality direct speech-to-speech translation with voice preservation
.
S2UT: Baseline model in Direct Speech-to-Speech Translation With Discrete Units
.
UnitY: Baseline model in UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units
.
TranSpeech: Baseline model in TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation
.
CVSS-C Fr-En
S2ST results on CVSS-C Fr→En test sets. “Sample id” denotes the id of the source audio. “ASR” denotes the speech recognition result of the corresponding audio.
Sample id | Ground truth | Predictions | ||||
---|---|---|---|---|---|---|
common_voice_ fr_17960551 |
Source audio | Target audio | Model | DASpeech | Translatotron | Translatotron 2 |
|
|
Audio |
|
|
|
|
Source text | Target text | ASR | no it is only an interruption | no that's ays an interruption | no this is just an interruption | |
Non, ce n’est qu’une interruption… | no it's only an interruption | Model | S2UT | UnitY | TranSpeech | |
Audio |
|
|
|
|||
ASR | no this is an interection | no this is only an interruption | this is what is interruption | |||
common_voice_ fr_19601543 |
Source audio | Target audio | Model | DASpeech | Translatotron | Translatotron 2 |
|
|
Audio |
|
|
|
|
Source text | Target text | ASR | it indicates the value of the slope | he indicates the valley of the ten | he indicates the value of the weight | |
Il indique la valeur de la pente. | it indicates the value of the slope | Model | S2UT | UnitY | TranSpeech | |
Audio |
|
|
|
|||
ASR | he indicates the value of the store | he shows the value of the door | he indicates the slow value | |||
common_voice_ fr_17740817 |
Source audio | Target audio | Model | DASpeech | Translatotron | Translatotron 2 |
|
|
Audio |
|
|
|
|
Source text | Target text | ASR | since it wasn't raining we didn't go to the movies | since it was not crane we did not go to the movies | since he was not raining we didn't go to the cinema | |
Puisqu’il ne pleuvait pas nous n’étions pas allés au cinéma. | since it wasn't raining we didn't go to the movies | Model | S2UT | UnitY | TranSpeech | |
Audio |
|
|
|
|||
ASR | since it was not raining we didn't go to the cinema | since it was not raining it didn't go to the movies | since it wasn't raining we won't go to the movies |
CVSS-T Fr-En
S2ST results on CVSS-T Fr→En test sets.
Sample id | Ground truth | Predictions | ||||
---|---|---|---|---|---|---|
common_voice_ fr_17960551 |
Source audio | Target audio | Model | DASpeech | Translatotron | Translatotron 2 |
|
|
Audio |
|
|
|
|
Source text | Target text | ASR | no it is only an interruption | no it is a indirection | no this is only one interruption | |
Non, ce n’est qu’une interruption… | no it's only an interruption | Model | S2UT | UnitY | TranSpeech | |
Audio |
|
|
|
|||
ASR | no it is only a stranger | no this is only one interruption | so this is interruption | |||
common_voice_ fr_19601543 |
Source audio | Target audio | Model | DASpeech | Translatotron | Translatotron 2 |
|
|
Audio |
|
|
|
|
Source text | Target text | ASR | it indicates the value of the slope | i gosis value of io | it indicates the value of the door | |
Il indique la valeur de la pente. | it indicates the value of the slope | Model | S2UT | UnitY | TranSpeech | |
Audio |
|
|
|
|||
ASR | he indicates the value of the punt | he indicates the value of the pont | he indicates the value of the scup | |||
common_voice_ fr_17740817 |
Source audio | Target audio | Model | DASpeech | Translatotron | Translatotron 2 |
|
|
Audio |
|
|
|
|
Source text | Target text | ASR | since it wasn't raining we didn't want to the movies | since this was not rain hande ito | since it was not raining we didn't go to the movies | |
Puisqu’il ne pleuvait pas nous n’étions pas allés au cinéma. | since it wasn't raining we didn't go to the movies | Model | S2UT | UnitY | TranSpeech | |
Audio |
|
|
|
|||
ASR | since it wasn't raining we did not go to the movies | since it was not raining we didn't go to the movies | since it war not raining we won't go to the movies |
CVSS-C X-En
Multilingual X→En S2ST results on CVSS-V test sets. “De” denotes Germany. “Es” denotes Spanish. “It” denotes Italian.
Sample id | Ground truth | Predictions | |||
---|---|---|---|---|---|
common_voice_ de_19650355 |
Source audio | Target audio | Model | DASpeech | Translatotron 2 |
|
|
Audio |
|
|
|
Source text | Target text | ASR | originally he wanted to become a journalist | initially he wanted to be a journalist | |
Ursprünglich wollte er Journalist werden. | originally he wanted to be a journalist | Model | S2UT | UnitY | |
Audio |
|
|
|||
ASR | originally he wanted to take a list | initially he wanted to be a journalist | |||
common_voice_ es_19723874 |
Source audio | Target audio | Model | DASpeech | Translatotron 2 |
|
|
Audio |
|
|
|
Source text | Target text | ASR | his grandmother was born in ireland | his grandfather was born in ireland | |
Su abuela nació en Irlanda. | his grandmother was born in ireland | Model | S2UT | UnitY | |
Audio |
|
|
|||
ASR | his wife is born in ireland | his wile was native to ireland | |||
common_voice_ it_20047586 |
Source audio | Target audio | Model | DASpeech | Translatotron 2 |
|
|
Audio |
|
|
|
Source text | Target text | ASR | the initial staff exceeded ten thousand units | the initial staff exceeded ten thousand units | |
Lo staff iniziale superava le diecimila unità. | the initial staff exceeded the ten thousand units | Model | S2UT | UnitY | |
Audio |
|
|
|||
ASR | the initial station was about ten million units | the initial state was exceeding ten thousand units |