close
close

DeepL introduces DeepL Voice, real-time text-based translations of voices and videos

DeepL has made a name for itself with online text translations that it claims are more nuanced and accurate than services from companies like Google – a pitch that has propelled the German startup to a $2 billion valuation and more than 100,000 paying customers catapulted.

As the hype around AI services continues to grow, DeepL is expanding the platform to include another mode: audio. Users can now use DeepL Voice to listen to someone speaking in one language and automatically translate it into another in real time.

English, German, Japanese, Korean, Swedish, Dutch, French, Turkish, Polish, Portuguese, Russian, Spanish and Italian are languages ​​that DeepL can “hear” today. Translated subtitles are available for all 33 languages ​​currently supported by DeepL Translator.

Photo credit:DeepL (opens in a new window) under a (opens in a new window) License.

DeepL Voice is currently not enough to deliver the result itself as an audio or video file: the service is aimed at real-time live conversations and video conferences and comes as text, not audio.

The first step is to set up your translations to appear as a “mirror” on a smartphone – the idea is that you place the phone between you on a meeting table so both sides can see the translated words – or as a transcription You share side by side with someone. The video conferencing service sees the translations as subtitles.

That could change over time, suggested Jarek Kutylowski, the company's founder and CEO (pictured above), in an interview. This is DeepL's first language product, but it probably won't be the last. “[Voice] This is where translation will play a role next year,” he added.

There is further evidence to support this statement. Google – one of DeepL's biggest competitors – has also started integrating real-time translated captions into its video conferencing service Meet. And there are a variety of AI startups developing language translation services, such as AI language specialist Eleven Labs (Eleven Labs Dubbing) and Panjaya, which creates translations using “deepfake” voices and videos that match the audio.

The latter uses Eleven Labs' API, and according to Kutylowski, Eleven Labs itself uses technology from DeepL to power its translation service.

Audio output isn't the only feature yet to be introduced.

There is also currently no API for the language product. DeepL's primary business focuses on B2B and Kutylowski said the company works directly with partners and customers.

There's also not a wide range of integrations: the only video calling service that currently supports DeepL's captions is Teams, which “covers most of our customers,” Kutylowski said. There is no information about when and whether Zoom or Google Meet will integrate DeepL Voice in the future.

The product has been a long time coming for DeepL users, and not just because we've been inundated with a host of other AI language services for translation. Kutylowski said this has been the most common customer request since 2017, the year DeepL was launched.

One reason for the wait is that DeepL has taken a fairly deliberate approach to developing its product. Unlike many others in the world of AI applications that rely on and optimize other companies' large language models (LLMs), DeepL's goal is to build its service from the ground up. In July, the company released a new LLM optimized for translation that it says outperforms GPT-4 and those of Google and Microsoft, not least because its primary purpose is translation. In addition, the company has continued to improve the quality of its written output and glossary.

Likewise, one of DeepL Voice's unique selling points is that it works in real time, which is important because many “AI translation” services on the market actually operate with a delay, making them more difficult or impossible to use in live situations. This is the use case that DeepL addresses.

Kutylowski suggested that this is another reason the new language processing product focuses on text-based translations: they can be calculated and produced very quickly, while the processing and AI architecture still has a long way to go before audio and Translations can be produced video so quickly.

Video conferencing and meetings are likely use cases for DeepL Voice, but Kutylowski noted that the company envisions another important use case in the service industry, where frontline workers in restaurants, for example, could use the service to make communicating with customers easier.

This could be useful, but it also illustrates one of the more difficult points of the service. In a world where we are all suddenly much more aware of privacy and concerns about how new services and platforms co-opt private or proprietary information, it remains to be seen how interested people will be in having their voices heard used this way.

Kutylowski insisted that while voices are transmitted to its servers for translation (the processing does not occur on the device), nothing is stored by its systems or used to train its LLMs. Ultimately, DeepL will work with its customers to ensure they are not violating GDPR or other data protection regulations.