TechBriefly
  • Tech
  • Business
  • Crypto
  • Science
  • Geek
  • How to
  • About
    • About TechBriefly
    • Terms and Conditions
    • Privacy Policy
    • Contact Us
    • Languages
      • 中文 (Chinese)
      • Dansk
      • Deutsch
      • Español
      • English
      • Français
      • Nederlands
      • Italiano
      • 日本语 (Japanese)
      • 한국인 (Korean)
      • Norsk
      • Polski
      • Português
      • Pусский (Russian)
      • Suomalainen
      • Svenska
No Result
View All Result
TechBriefly
Home Tech AI
Meet Microsoft VALL-E: Text-to-speech sibling of DALL-E

Meet Microsoft VALL-E: Text-to-speech sibling of DALL-E

Be careful because there is a very realistic sounding AI out there!

Emre ÇıtakbyEmre Çıtak
10 January 2023
in AI
Reading Time: 3 mins read
Share on FacebookShare on Twitter

Microsoft’s take on text-to-speech synthesis, Microsoft VALL-E has been announced in a paper published by the company. The audio model only requires a 3-second audio file to process the given input.

Microsoft VALL-E, a novel language model approach for text-to-speech synthesis (TTS) that leverages audio codec codes as intermediate representations, was just launched by Microsoft. It was pre-trained on 60,000 hours of English speech data and then displayed in-context learning abilities in zero-shot circumstances.

Microsoft VALL-E
Microsoft VALL-E is a language model approach for text-to-speech synthesis

Microsoft VALL-E can produce high-quality personalized speech with just a three-second enrolled recording of an oblique speaker acting as an acoustic stimulus. It does so without the need for additional structural engineering, pre-designed acoustic features, or fine-tuning. It supports contextual learning and prompt-based zero-shot TTS approaches. It appears that the scaling up of semi-supervised data for TTS has been underutilized because Microsoft has used a significant amount of semi-supervised data to construct a generalized TTS system in the speaker dimension.

What can you do with Microsoft VALL-E?

According to the researchers, Microsoft VALL-E is a “neural codec language model” that was trained using discrete codes that were “derived from a pre-existing neural audio codec model.” It was trained on 60 thousand hours of speech, which is “hundreds of times greater than existing systems,” according to the statement. These examples are convincing in contrast to prior attempts that are very obviously robots, even though AI has been around for a while that can realistically mimic human speech.

Microsoft VALL-E
Microsoft VALL-E was trained on 60 thousand hours of speech

Microsoft VALL-E can “preserve the speaker’s emotion and auditory environment,” according to the researchers, of the prompt. Although it is impressive, technology is still a long way from replacing voice actors because finding the appropriate tone and emotion during a performance is different. Even an advanced version of Microsoft VALL-E wouldn’t be able to perform as well as a skilled professional, yet businesses often prioritize cost-effectiveness over quality.

On Microsoft’s GitHub demo, you can listen to some of the samples.

Microsoft VALL-E features

Although Microsoft VALL-E is very new, it already has many features.

Synthesis of diversity: Because Microsoft VALL-E generates discrete tokens using the sampling-based technique, its output varies for the same input text. It may therefore synthesize different personalized speech samples using a variety of random seeds.

Acoustic environment maintenance: Microsoft VALL-E can provide customized speech while maintaining the acoustic environment of the speaker prompt. In comparison to the baseline, VALL-E is trained on a big dataset with more acoustic variables. The audio and transcriptions were produced using samples from the Fisher dataset.

Microsoft VALL-E
Microsoft VALL-E can provide customized speech while maintaining the acoustic environment of the speaker prompt

Speaker’s emotion maintenance: Using the Emotional Voices Database as a resource, for example, audio prompts, Microsoft VALL-E may create customized speech while maintaining the emotional tenor of the speaker prompt. Traditional approaches train a model by correlating the speech to transcription and an emotion label in a supervised emotional TTS dataset. VALL-E can keep the emotion in the prompt even in a zero-shot situation.

Microsoft VALL-E still has issues with model structure, data coverage, and synthesis robustness.

How does Microsoft VALL-E work?

Microsoft used LibriLight, an audio library put together by Meta, to train VALL-voice E’s synthesis skills. Most of the 60,000 hours of English-language speech are taken from LibriVox public domain audiobooks and spoken by more than 7,000 people. The voice in the three-second sample must closely resemble a voice in the training data for VALL-E to get a satisfactory result.

Microsoft VALL-E
7,000 different people have helped Microsoft VALL-E to be build

Microsoft offers dozens of audio examples of the AI model in action on the VALL-E example page. The “Speaker Prompt,” one of the samples, is the three seconds of audio that VALL-E is instructed to mimic. The “Ground Truth” is a previously recorded excerpt from that speaker that is used as a benchmark (sort of like the “control” in the experiment). The “VALL-E” sample is the output from the VALL-E model, and the “Baseline” sample is an example of synthesis produced by a traditional text-to-speech synthesis approach.

While Microsoft VALL-E made history as the first, but certainly not the last, major AI project of 2023, the technology giant financially supported OpenAI Point-E, which was published in the last weeks of 2022.

 

 

Tags: featuredtext-to-speech
ShareTweet
Emre Çıtak

Emre Çıtak

Emre’s love for animals made him a veterinarian, and his passion for technology made him an editor. Making new discoveries in the field of editorial and journalism, Emre enjoys conveying information to a wide audience, which has always been a dream for him.

Related Posts

OpenAI announces ChatGPT Health feature

OpenAI announces ChatGPT Health feature

8 January 2026
Google Classroom turns lessons into podcasts with Gemini

Google Classroom turns lessons into podcasts with Gemini

8 January 2026
Caterpillar partners with Nvidia to bring AI to the construction site

Caterpillar partners with Nvidia to bring AI to the construction site

8 January 2026
Lenovo unveils Qira AI assistant for PCs and Motorola phones

Lenovo unveils Qira AI assistant for PCs and Motorola phones

7 January 2026
Please login to join discussion

LATEST

Announcements made by Samsung Display at CES 2026

Bluetti unveils dual-input Charger 2 at CES 2026

Whisper Aero unveils T1 leaf blower at CES 2026

OpenAI announces ChatGPT Health feature

Google Classroom turns lessons into podcasts with Gemini

Roblox users must scan faces to unlock chat features

Caterpillar partners with Nvidia to bring AI to the construction site

WhatsApp adds member tags and text stickers to group chats

Spotify brings live listening activity to Messages

How to use the exit command in Windows Command Prompt

TechBriefly

© 2021 TechBriefly is a Linkmedya brand.

  • Tech
  • Business
  • Science
  • Geek
  • How to
  • About
  • Privacy
  • Terms
  • Contact
  • | Network Sites |
  • Digital Report
  • LeaderGamer

Follow Us

No Result
View All Result
  • Tech
  • Business
  • Crypto
  • Science
  • Geek
  • How to
  • About
    • About TechBriefly
    • Terms and Conditions
    • Privacy Policy
    • Contact Us
    • Languages
      • 中文 (Chinese)
      • Dansk
      • Deutsch
      • Español
      • English
      • Français
      • Nederlands
      • Italiano
      • 日本语 (Japanese)
      • 한국인 (Korean)
      • Norsk
      • Polski
      • Português
      • Pусский (Russian)
      • Suomalainen
      • Svenska