StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning

1Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS)
2Key Laboratory of AI Safety, Chinese Academy of Sciences
3University of Chinese Academy of Sciences, Beijing, China
4School of Future Science and Engineering, Soochow University
ACL 2024 Main Conference

Abstract

Simultaneous speech-to-speech translation (Simul-S2ST, a.k.a streaming speech translation) outputs target speech while receiving streaming speech inputs, which is critical for real-time communication. Beyond accomplishing translation between speech, Simul-S2ST requires a policy to control the model to generate corresponding target speech at the opportune moment within speech inputs, thereby posing a double challenge of translation and policy. In this paper, we propose StreamSpeech, a direct Simul-S2ST model that jointly learns translation and simultaneous policy in a unified framework of multi-task learning. StreamSpeech is an “All in One” seamless streaming model for speech recognition, speech translation and speech synthesis, which can effectively identifies the opportune moment to start translating within the streaming speech inputs. Experiments on CVSS benchmark demonstrate that StreamSpeech achieves state-of-the-art performance in both offline S2ST and Simul-S2ST tasks. Besides, StreamSpeech is able to present high-quality intermediate results (i.e., ASR or translation results) during simultaneous translation process, offering a more comprehensive real-time communication experience.

StreamSpeech can simultaneously provide ASR, translation, and synthesis results

Introducing StreamSpeech

StreamSpeech is an “All in One” seamless model for offline and simultaneous speech recognition, speech translation and speech synthesis.
StreamSpeech employs two-pass architecture that first converts source speech into target text hidden states (autoregressive speech-to-text translation, AR-S2TT) and then generates target speech via non-autoregressive text-to-unit generation. The source/target/unit CTC decoders are introduced to learn alignments via multiple tasks of speech recognition (ASR), non-autoregressive speech-to-text translation (NAR-S2TT) and speech-to-unit translation (S2UT), accordingly guiding StreamSpeech when to start recognizing, translating and synthesizing.

🎈Highlight:
1. StreamSpeech achieves state-of-the-art performance on both offline and simultaneous speech-to-speech translation.
2. StreamSpeech can perform streaming ASR, simultaneous speech-to-text translation and simultaneous speech-to-speech translation via an "All in One" seamless model.
3. StreamSpeech can present intermediate results (i.e., ASR or translation results) during simultaneous translation, offering a more comprehensive low-latency communication experience.

🎧 Listen to StreamSpeech 🎧

French-to-English: common_voice_fr_17301936.mp3

Speech Inputs
Ground Truth Transcription: jai donc lexpérience des années passées jen dirai un mot tout à lheure
Translation: i therefore have the experience of the passed years i'll say a few words about that later
Target Speech:
Tasks ASR S2TT S2ST (output) S2ST (multi-channel)
left👂🏻: inputs | right👂🏻: outputs
Offline UnitY N/A so i have the experience of years passed i'll tell a word later
StreamSpeech jai donc lexpérience des années passé jen dirairai un mot tout à lheure so i have the experience in the past years i'll say a word later
Simultaneous Wait-k (k=3) N/A the therefore i have the experience of past years i will tell you a word later
StreamSpeech
(chunk=320ms)
jai donc expérience des années passé jen dirairai un mot tout à lheure i therefore have an experience of last years i will tell a word later

🎗StreamSpeech Performance🎗

1. Offline speech-to-speech translation on CVSS-C benchmark

MY ALT TEXT



2. Simultaneous speech-to-speech translation on CVSS-C benchmark

MY ALT TEXT



3. Performance on Streaming ASR

MY ALT TEXT



4. Performance on Simultaneous Speech-to-Text Translation

MY ALT TEXT

BibTeX

If you have any questions, please contact Shaolei Zhang (zhangshaolei20z@ict.ac.cn).

@inproceedings{streamspeech,
        title={StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning}, 
        author={Shaolei Zhang and Qingkai Fang and Shoutao Guo and Zhengrui Ma and Min Zhang and Yang Feng},
        year={2024},
        booktitle = {Proceedings of the 62th Annual Meeting of the Association for Computational Linguistics (Long Papers)},
        publisher = {Association for Computational Linguistics}
  }