Announcing Video Translation & Speech Translation API Enhancements (2024)

Today, we are excited to share two major updates to the Azure AI Speech Translation product suite – Video Translation and an enhanced Realtime Speech Translation API.

Video Translation (Batch)

Today, we are announcing the availability of Video Translation, a groundbreaking service designed to transform the way businesses localize their video content, in preview. With the rising demand for accessible and engaging video content across global markets, Video Translation offers a seamless solution to overcome language barriers. This launch includes an Azure Speech for customers to try out with their own video assets, with turn-key capabilities such as:

Dialogue extraction and translated subtitles generation
GPT reformulation with improved translation quality and automatic time alignment
Prebuilt neural voices with content editing for precise alignment and translation preference manually
Using the personal voice capability (will be available with limited access restrictions)

The corresponding Video Translation API is also coming soon, please fill in the form here to be considered for API early access.

Customer Scenarios for Video Translation

Video translation unlocks business values for a wide range of business scenarios with the authorized video content such as :

TV shows, movies & documentary: film studios and production companies can translate movies and TV shows for international distribution, reaching a broader audience and maximizing revenue potential.
Education & training materials video: educational institutions and/or training programs can translate and dub learning video materials to provide accurate and timely information to audiences worldwide.
Advertising & marketing video: businesses can localize their advertising and marketing videos to resonate with target audiences in different markets, enhancing brand awareness and customer engagement.

Language coverage for Video Translation

Video Translation supports the language pairs in the table below:

Source language	Target language
Hindi	English
Spanish	English
Chinese	English
Korean	English
English	Hindi
English	Spanish
English	Chinese
English	Italian
English	German
English	Russian

We also plan to quickly expand our language coverage in future releases.

Multilingual Speech Translation (Realtime)

In addition to Video Translation, we are also excited to announce automatic multilingual speech translation as a major enhancement to our Realtime Speech Translation API. This launch contains a new range of features that enable a higher level of translation capabilities that were previously not possible:

The biggest change is that there is no longer a need for the user to set an input language. The API now gives users the ability to receive audio in a wide range of languages, without specifying beforehand which language is being spoken. This will enable them to translate audio in scenarios where they may not know what language is being received, such as a contact center servicing a diverse global client base. We are very excited to introduce this feature, as it will open up a whole new world of possibilities for multilingual use cases.
In addition to receiving the translated audio, the user can also be told what language is being spoken in the input audio through Language Identification (LID) Support. While the model still operates end-to-end with the ability to handle multiple languages, the user can still receive a list of each of the languages that were spoken during the session. This can be useful for documentation purposes or scenarios where there are multiple speakers, like a multilingual meeting.
The Speech Translation service is now also capable of handling language switches within the same session. We allow users the ability to receive input audio in multiple languages, and have them all translated and output into a target language. There is no need to set an input language and no need to make a new API call when the language changes, the same session can automatically handle language changes and output the translation of desired language. This will be useful for users who have multiple native languages that they often switch between, or a multilingual meeting in a business or educational environment, and it is a very exciting feature that will create new levels of translation capabilities.

Customer Scenarios for Realtime Speech Translation

The following are some new customer scenarios that Multilingual Speech Translation enables that previously were not possible:

Translation in your daily life: Imagine you are walking down a diverse city such as New York, and a foreigner comes up to you. They start asking you about something in their own language, but not only do you not know what they are saying, you don’t even know what language they are speaking. With our API integrated into a solution such as a translator mobile app, not only will you be told what language the user is speaking, you will also receive a full translation of it in text (and audio if you choose), allowing you to freely communicate with this person as if there was no language barrier at all.
Live translated caption for videos: Let’s say you are dabbling in French cuisine and would like to follow through a video of a chef making Boeuf bourguignon – all in French (which you do not speak). Using a translator mobile app powered by Azure’s Speech Translation API, you can now play the video and get streaming translated captions, as if the chef is teaching you in real time!
Multilingual meeting: A unique use case that really showcases the power of the new multilingual model is a meeting which include native speakers of many languages. Suppose a situation such as a diplomatic meeting, in which there are delegates from many countries conversing with each other. If they are each enabled with our API, they can all freely speak to each other and have a natural conversation without having to worry about a language barrier at all. The API can automatically handle language switches and still translate to the target language. This allows for seamless conversation even with speakers of multiple languages all being in the same room.

As you can see from the above scenarios, Multilingual Speech Translation opens the door to new possibilities that previously would’ve been tedious, incredibly inefficient, or downright impossible.

Language Support for Multilingual Speech Translation

At the time of Public Preview, Multilingual Speech Translation will be offered with input languages. This means these are the languages the API will automatically detect and switch between from the input. The output (target) language can still be any of the languages supported by the Azure Speech Translation Service. The 40 input languages are as follows (along with language code):

Arabic (ar), Basque (eu), Bosnian (bs), Bulgarian (bg), Chinese Simplified (zh), Chinese Traditional (zhh), Czech (cs), Danish (da), Dutch (nl), English (en), Estonian (et), Finnish (fi), French (fr), Galician (gl), German (de), Greek (el), Hindi (hi), Hungarian (hu), Indonesian (id), Italian (it), Japanese (ja), Korean (ko), Latvian (lv), Lithuanian (lt), Macedonian (mk), Norwegian (nb), Polish (pl), Portuguese (pt), Romanian (ro), Russian (ru), Serbian (sr), Slovak (sk), Slovenian (sl), Spanish (es), Swedish (sv), Thai (th), Turkish (tr), Ukrainian (uk), Vietnamese (vi), and Welsh (cy).

In the upcoming version, we plan to support all input languages that are supported by Speech Translation. Language and locale support will be continuously updated and expanded to make our model more accessible to all.

Getting Started

Get started with the Video Translation by uploading your own video today.

Get started with implementing Multilingual Speech Translation into your products by using our Quickstart Guide.