Speech-to-speech translation (S2ST) has been a transformative technology in breaking down language barriers, but the scarcity of parallel speech data has hindered its progress. Most existing models require supervised settings and struggle with learning translation and speech attribute reconstruction from synthesized training data.

In speech-to-speech translation, previous models from Google AI, like Translatotron 1 and Translatotron 2, have made notable advancements by directly translating speech between languages. However, these models faced limitations as they relied on supervised training with parallel speech data. The pivotal challenge lies in the scarcity of such parallel data, rendering the training of S2ST models a complex task. Here enters Translatotron 3, a groundbreaking solution introduced by a Google research team.

The researchers recognized that most public datasets for speech translation are semi- or fully synthesized from text, leading to additional hurdles in learning translation and accurately reconstructing speech attributes that may need to be better represented in the text. In response, Translatotron 3 represents a paradigm shift by introducing the concept of unsupervised S2ST, which aims to learn the translation task solely from monolingual data. This innovation expands the potential for translation across various language pairs and introduces the capability to translate non-textual speech attributes such as pauses, speaking rates, and speaker identity.


Translatotron 3’s architecture is designed with three key aspects to address the challenges of unsupervised S2ST:

  1. Pre-training as a Masked Autoencoder with SpecAugment: The entire model is pre-trained as a masked autoencoder, utilizing SpecAugment—a simple data augmentation method for speech recognition. SpecAugment operates on the input audio’s logarithmic mel spectrogram, enhancing the encoder’s generalization capabilities.
  2. Unsupervised Embedding Mapping based on Multilingual Unsupervised Embeddings (MUSE): Translatotron 3 leverages MUSE, a technique trained on unpaired languages that enables the model to learn a shared embedding space between the source and target languages. This shared embedding space facilitates more efficient and effective encoding of input speech.
  3. Reconstruction Loss through Back-Translation: The model is trained using a combination of unsupervised MUSE embedding loss, reconstruction loss, and S2S back-translation loss. During inference, a shared encoder encodes the input into a multilingual embedding space, subsequently decoded by the target language decoder.

Translatotron 3’s training methodology consists of auto-encoding with reconstruction and a back-translation term. In the first part, the network is trained to auto-encode the input into a multilingual embedding space using the MUSE loss and the reconstruction loss. This phase aims to ensure that the network generates meaningful multilingual representations. The network is further trained to translate the input spectrogram using the back-translation loss in the second part. To enforce the latent space’s multilingual nature, the MUSE loss and the reconstruction loss are applied in this second part of the training. SpecAugment is applied to the encoder input at both phases to ensure meaningful properties are learned.

The empirical evaluation of Translatotron 3 demonstrates its superiority over a baseline cascade system, particularly excelling in preserving conversational nuances. The model outperforms in translation quality, speaker similarity, and speech quality. Despite being an unsupervised method, Translatotron 3 is a robust solution, showcasing remarkable results compared to existing systems. Its ability to achieve speech naturalness comparable to ground truth audio samples, as measured by the Mean Opinion Score (MOS), underlines its effectiveness in real-world scenarios.


In addressing the challenge of unsupervised S2ST due to the scarcity of parallel speech data, Translatotron 3 emerges as a pioneering solution. By learning from monolingual data and leveraging MUSE, the model achieves superior translation quality and preserves essential non-textual speech attributes. The research team’s innovative approach signifies a significant step towards making speech-to-speech translation more versatile and effective across various language pairs. Translatotron 3’s success in outperforming existing models demonstrates its potential to revolutionize the field and enhance communication between diverse linguistic communities. In future work, the team aims to extend the model to more languages and explore its applicability in zero-shot S2ST scenarios, potentially broadening its impact on global communication.

Check out the Paper and Reference ArticleAll credit for this research goes to the researchers of this project. Also, don’t forget to join our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

Madhur Garg is a consulting intern at MarktechPost. He is currently pursuing his B.Tech in Civil and Environmental Engineering from the Indian Institute of Technology (IIT), Patna. He shares a strong passion for Machine Learning and enjoys exploring the latest advancements in technologies and their practical applications. With a keen interest in artificial intelligence and its diverse applications, Madhur is determined to contribute to the field of Data Science and leverage its potential impact in various industries.

Source link