Recently proposed two-pass direct speech-to-speech translation (S2ST) models decompose the task into speech-to-text translation (S2TT) and text-to-speech (TTS) within an end-to-end model, yielding promising results. However, the training of these models still relies on parallel speech data, which is extremely challenging to collect. In contrast, S2TT and TTS have accumulated a large amount of data and pretrained models, which have not been fully utilized in the development of S2ST models. Inspired by this, in this paper, we first introduce a composite S2ST model named ComSpeech, which can seamlessly integrate any pretrained S2TT and TTS models into a direct S2ST model. Furthermore, to eliminate the reliance on parallel speech data, we propose a novel training method ComSpeech-ZS that solely utilizes S2TT and TTS data. It aligns representations in the latent space through contrastive learning, enabling the speech synthesis capability learned from the TTS data to generalize to S2ST in a zero-shot manner. Experimental results on the CVSS dataset show that when the parallel speech data is available, ComSpeech surpasses previous two-pass models like UnitY and Translatotron 2 in both translation quality and decoding speed. When there is no parallel speech data, ComSpeech-ZS lags behind \name by only 0.7 ASR-BLEU and outperforms the cascaded models.
ComSpeech is a novel two-pass S2ST model architecture that can seamlessly integrate any S2TT and TTS models into a direct S2ST model. The key component is a CTC-based vocabulary adaptor, which addresses the vocabulary mismatch between the S2TT and TTS models. This allows us to fully leverage existing pre-trained S2TT and TTS models, thereby benefiting from the latest advancements in the S2TT and TTS research communities.
ComSpeech-ZS is a novel training method for ComSpeech that relies solely on S2TT and TTS data, without any parallel speech data. The key idea is to achieve representation alignment through contrastive learning within the representation space of the TTS encoder, thereby zero-shot generalizing the model's TTS capabilities to S2ST. This significantly reduces the difficulty of collecting S2ST training data and allows for the full utilization of existing S2TT and TTS data.
In the supervised learning scenario, we find that:
In the zero-shot learning scenario, we find that:
Sample id | Ground truth | Predictions | |||
---|---|---|---|---|---|
common_voice_ fr_17330818 |
Source audio | Model | ComSpeech | UnitY | Translatotron 2 |
|
Audio |
|
|
|
|
La parole est à Madame la ministre déléguée. | ASR | our honorable deputy minister has the floor to speak | the floor is open to missus the delegated minister | i shall give the floor to madame deputy minister | |
Target audio | Model | ComSpeech-ZS | Cascade | ||
|
Audio |
|
|
||
our deputy minister has the floor to speak | ASR | our honorable deputy minister has the floor to speak | i shall give the floor to missus the delegated minister | ||
common_voice_ fr_17305981 |
Source audio | Model | ComSpeech | UnitY | Translatotron 2 |
|
Audio |
|
|
|
|
"La civilisation Maya est très ancienne et mérite qu'on y porte plus d'attention." | ASR | men civilization is very old and deserves more attention to it | the ancient mayan civilization is more careful | the main civilization is an ancient in america carries more attention to it | |
Target audio | Model | ComSpeech-ZS | Cascade | ||
|
Audio |
|
|
||
the mayan civilization is very old and deserves more attention | ASR | mayan civilization is very old and deserves more attention to it | the maya civilization very old in merits have more attention | ||
common_voice_ fr_17976517 |
Source audio | Model | ComSpeech | UnitY | Translatotron 2 |
|
Audio |
|
|
|
|
La parole est à Monsieur Dominique Tian, pour soutenir l’amendement numéro cent cinq. | ASR | dominique t n may now speak in support of amendment number one hundred and five | the floor is now open to mister dominique tien to support the one hundred fifth amendment | the floor is to mister dominique tan to defend the amendment number one hundred and five | |
Target audio | Model | ComSpeech-ZS | Cascade | ||
|
Audio |
|
|
||
mr dominique tian may now speak in support of amendment number five hundred | ASR | dominicue dan may now speak in support of amendment number one hundred and five | the floor is now open to mister dam and ichtion to support the one hundred fifth amendment |
Sample id | Ground truth | Predictions | |||
---|---|---|---|---|---|
common_voice_ de_19769879 |
Source audio | Model | ComSpeech | UnitY | Translatotron 2 |
|
Audio |
|
|
|
|
Deutschland folgte mit acht, dahinter Russland mit sechs Goldmedaillen. | ASR | germany followed with eight days of russia with six gold medals | germany was followed by eight because he joined by great medals | pursued at the rosland as well as six gold medals | |
Target audio | Model | ComSpeech-ZS | Cascade | ||
|
Audio |
|
|
||
germany followed with eight right before russia with six gold medals | ASR | germany followed with eight days behind russia with six gold medal | on average there followed six gold medals with the eight days | ||
common_voice_ de_17881788 |
Source audio | Model | ComSpeech | UnitY | Translatotron 2 |
|
Audio |
|
|
|
|
Der Hefeteig wird ein paar Stunden brauchen. | ASR | the madman will take a few hours | the hostile mother is awaiting something | the ponetic transcription will occur | |
Target audio | Model | ComSpeech-ZS | Cascade | ||
|
Audio |
|
|
||
the yeast dough will take a few hours | ASR | the madman will take a few hours | the safe was already unlocked when i came in | ||
common_voice_ de_19268197 |
Source audio | Model | ComSpeech | UnitY | Translatotron 2 |
|
Audio |
|
|
|
|
Foster war zunächst als Modell in der Modebranche tätig. | ASR | forster initially worked as a model in fashion design | forster was initially working as a model in the fashion store | forster was initially a model in the fashion industry | |
Target audio | Model | ComSpeech-ZS | Cascade | ||
|
Audio |
|
|
||
foster initially worked as a model in fashion industry | ASR | forster initially worked as a model in fashion design | forster was initially active as a model in the fashion design |
Sample id | Ground truth | Predictions | |||
---|---|---|---|---|---|
common_voice_ es_19678239 |
Source audio | Model | ComSpeech | UnitY | Translatotron 2 |
|
Audio |
|
|
|
|
El Atom fue elegido por su diseño ligero y eficiente. | ASR | the act was chosen for its light and efficient design | the atam was chosen by his light designed and efficient | the act was chosen by its like design and official | |
Target audio | Model | ComSpeech-ZS | Cascade | ||
|
Audio |
|
|
||
the atom was chosen for its light and efficient design | ASR | the act was chosen for its light and efficient design | the act was chosen by its light design and efficient | ||
common_voice_ es_19326109 |
Source audio | Model | ComSpeech | UnitY | Translatotron 2 |
|
Audio |
|
|
|
|
Se encuentra en pastizales, márgenes de campos y caminos. | ASR | it can be found in pastures fields and roads | they are found in pastures fields margins and rots | it is located in pastures seas of fields and roads | |
Target audio | Model | ComSpeech-ZS | Cascade | ||
|
Audio |
|
|
||
it can be found in pastures field margins and roads | ASR | it can be found in pastures fields and roads | they are found in pastures fields images and paths | ||
common_voice_ es_19958814 |
Source audio | Model | ComSpeech | UnitY | Translatotron 2 |
|
Audio |
|
|
|
|
Primer profesor peruano de áreas protegidas y manejo de fauna. | ASR | first peruvian professor of protected areas and booming | it has a peruvian professor of protected areas in the state of hang | he is a peruvian professor of artists and masterpiece | |
Target audio | Model | ComSpeech-ZS | Cascade | ||
|
Audio |
|
|
||
first peruvian professor of protected areas and fauna management | ASR | first peruvian professor of protected areas in this area | he serves as a peruvian professor and maines delphin |
@inproceedings{fang-etal-2024-can,
title = {Can We Achieve High-quality Direct Speech-to-Speech Translation Without Parallel Speech Data?},
author = {Fang, Qingkai and Zhang, Shaolei and Ma, Zhengrui and Zhang, Min and Feng, Yang},
booktitle = {Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics},
year = {2024},
}