Abstract

Recently proposed two-pass direct speech-to-speech translation (S2ST) models decompose the task into speech-to-text translation (S2TT) and text-to-speech (TTS) within an end-to-end model, yielding promising results. However, the training of these models still relies on parallel speech data, which is extremely challenging to collect. In contrast, S2TT and TTS have accumulated a large amount of data and pretrained models, which have not been fully utilized in the development of S2ST models. Inspired by this, in this paper, we first introduce a composite S2ST model named ComSpeech, which can seamlessly integrate any pretrained S2TT and TTS models into a direct S2ST model. Furthermore, to eliminate the reliance on parallel speech data, we propose a novel training method ComSpeech-ZS that solely utilizes S2TT and TTS data. It aligns representations in the latent space through contrastive learning, enabling the speech synthesis capability learned from the TTS data to generalize to S2ST in a zero-shot manner. Experimental results on the CVSS dataset show that when the parallel speech data is available, ComSpeech surpasses previous two-pass models like UnitY and Translatotron 2 in both translation quality and decoding speed. When there is no parallel speech data, ComSpeech-ZS lags behind \name by only 0.7 ASR-BLEU and outperforms the cascaded models.

Introduction

ComSpeech is a novel two-pass S2ST model architecture that can seamlessly integrate any S2TT and TTS models into a direct S2ST model. The key component is a CTC-based vocabulary adaptor, which addresses the vocabulary mismatch between the S2TT and TTS models. This allows us to fully leverage existing pre-trained S2TT and TTS models, thereby benefiting from the latest advancements in the S2TT and TTS research communities.

ComSpeech-ZS is a novel training method for ComSpeech that relies solely on S2TT and TTS data, without any parallel speech data. The key idea is to achieve representation alignment through contrastive learning within the representation space of the TTS encoder, thereby zero-shot generalizing the model's TTS capabilities to S2ST. This significantly reduces the difficulty of collecting S2ST training data and allows for the full utilization of existing S2TT and TTS data.

Experiments

In the supervised learning scenario, we find that:

  1. ComSpeech outperforms Translatotron 2 and UnitY in translation quality across all three language pairs (A1-A2 vs. A5-A6).
  2. Benefiting from the parallel decoding capability of FastSpeech 2, ComSpeech achieves a 3.40× decoding speedup compared with Translatotron 2.
  3. Compared to DASpeech, which also employs FastSpeech 2 as the TTS module, ComSpeech shows a 3.1 ASR-BLEU improvement in translation quality.

In the zero-shot learning scenario, we find that:

  1. Despite not using any S2ST data, the translation quality of ComSpeech-ZS is comparable to that of ComSpeech in the supervised learning scenario, with only a 0.7 ASR-BLEU difference (A6 vs. B2).
  2. ComSpeech-ZS also surpasses the performance of Translatotron 2, UnitY, S2UT, and DASpeech (A1-A4 vs. B2).
  3. The translation quality of ComSpeech-ZS surpasses that of the cascaded system, possibly due to avoiding error accumulation.

Audio Samples

Fr-En

Sample id Ground truth Predictions
common_voice_
fr_17330818
Source audio Model ComSpeech UnitY Translatotron 2
Audio
La parole est à Madame la ministre déléguée. ASR our honorable deputy minister has the floor to speak the floor is open to missus the delegated minister i shall give the floor to madame deputy minister
Target audio Model ComSpeech-ZS Cascade
Audio
our deputy minister has the floor to speak ASR our honorable deputy minister has the floor to speak i shall give the floor to missus the delegated minister
common_voice_
fr_17305981
Source audio Model ComSpeech UnitY Translatotron 2
Audio
"La civilisation Maya est très ancienne et mérite qu'on y porte plus d'attention." ASR men civilization is very old and deserves more attention to it the ancient mayan civilization is more careful the main civilization is an ancient in america carries more attention to it
Target audio Model ComSpeech-ZS Cascade
Audio
the mayan civilization is very old and deserves more attention ASR mayan civilization is very old and deserves more attention to it the maya civilization very old in merits have more attention
common_voice_
fr_17976517
Source audio Model ComSpeech UnitY Translatotron 2
Audio
La parole est à Monsieur Dominique Tian, pour soutenir l’amendement numéro cent cinq. ASR dominique t n may now speak in support of amendment number one hundred and five the floor is now open to mister dominique tien to support the one hundred fifth amendment the floor is to mister dominique tan to defend the amendment number one hundred and five
Target audio Model ComSpeech-ZS Cascade
Audio
mr dominique tian may now speak in support of amendment number five hundred ASR dominicue dan may now speak in support of amendment number one hundred and five the floor is now open to mister dam and ichtion to support the one hundred fifth amendment

De-En

Sample id Ground truth Predictions
common_voice_
de_19769879
Source audio Model ComSpeech UnitY Translatotron 2
Audio
Deutschland folgte mit acht, dahinter Russland mit sechs Goldmedaillen. ASR germany followed with eight days of russia with six gold medals germany was followed by eight because he joined by great medals pursued at the rosland as well as six gold medals
Target audio Model ComSpeech-ZS Cascade
Audio
germany followed with eight right before russia with six gold medals ASR germany followed with eight days behind russia with six gold medal on average there followed six gold medals with the eight days
common_voice_
de_17881788
Source audio Model ComSpeech UnitY Translatotron 2
Audio
Der Hefeteig wird ein paar Stunden brauchen. ASR the madman will take a few hours the hostile mother is awaiting something the ponetic transcription will occur
Target audio Model ComSpeech-ZS Cascade
Audio
the yeast dough will take a few hours ASR the madman will take a few hours the safe was already unlocked when i came in
common_voice_
de_19268197
Source audio Model ComSpeech UnitY Translatotron 2
Audio
Foster war zunächst als Modell in der Modebranche tätig. ASR forster initially worked as a model in fashion design forster was initially working as a model in the fashion store forster was initially a model in the fashion industry
Target audio Model ComSpeech-ZS Cascade
Audio
foster initially worked as a model in fashion industry ASR forster initially worked as a model in fashion design forster was initially active as a model in the fashion design

Es-En

Sample id Ground truth Predictions
common_voice_
es_19678239
Source audio Model ComSpeech UnitY Translatotron 2
Audio
El Atom fue elegido por su diseño ligero y eficiente. ASR the act was chosen for its light and efficient design the atam was chosen by his light designed and efficient the act was chosen by its like design and official
Target audio Model ComSpeech-ZS Cascade
Audio
the atom was chosen for its light and efficient design ASR the act was chosen for its light and efficient design the act was chosen by its light design and efficient
common_voice_
es_19326109
Source audio Model ComSpeech UnitY Translatotron 2
Audio
Se encuentra en pastizales, márgenes de campos y caminos. ASR it can be found in pastures fields and roads they are found in pastures fields margins and rots it is located in pastures seas of fields and roads
Target audio Model ComSpeech-ZS Cascade
Audio
it can be found in pastures field margins and roads ASR it can be found in pastures fields and roads they are found in pastures fields images and paths
common_voice_
es_19958814
Source audio Model ComSpeech UnitY Translatotron 2
Audio
Primer profesor peruano de áreas protegidas y manejo de fauna. ASR first peruvian professor of protected areas and booming it has a peruvian professor of protected areas in the state of hang he is a peruvian professor of artists and masterpiece
Target audio Model ComSpeech-ZS Cascade
Audio
first peruvian professor of protected areas and fauna management ASR first peruvian professor of protected areas in this area he serves as a peruvian professor and maines delphin

BibTeX

@inproceedings{fang-etal-2024-can,
    title = {Can We Achieve High-quality Direct Speech-to-Speech Translation Without Parallel Speech Data?},
    author = {Fang, Qingkai and Zhang, Shaolei and Ma, Zhengrui and Zhang, Min and Feng, Yang},
    booktitle = {Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics},
    year = {2024},
}