Speech synthesis

Speech synthesis in Yandex SpeechKit allows you to convert any text to speech in multiple languages.

SpeechKit voice models use deep neural network technology. When synthesizing speech, a model registers a lot of features of the original voice. The model evaluates the entire text rather than individual sentences, before starting the synthesis. This helps make the synthesized voice sound clear and natural, without electronic distortion, and reproduce relevant human intonations.

The service is available at tts.api.cloud.yandex.net:443.

To try out the Text-to-Speech and Speech-to-Text product demos, visit the SpeechKit page on our website.

Synthesis features

To work with SpeechKit, you can access it via the API or Playground. For more information about working with the Yandex Cloud API, see API concepts.

SpeechKit synthesis has two APIs: API v1 and API v3.

	API v1	API v3
Specification	REST	gRPC, REST
Voice selection	`voice`	`hints: voice`
Language selection	Depends on the voice `lang`	Depends on the voice, not specified explicitly in the request
Role setting	Depends on the voice `emotion`	Depends on the voice `hints: role`
Voice pitch management	No	`hints: pitch_shift`
Pronunciation control	SSML TTS	TTS
Pronunciation rate	`speed`	`hints: speed`
Volume adjustment	No	`loudness_normalization_type`
Output audio format	`format`	`output_audio_spec`
Setting LPCM parameters	`sampleRateHertz`	`output_audio_spec: raw_audio`
Audio pattern-based synthesis	No	`text_template`
Pricing method	Total number of characters in requests	By request
Automatic splitting of long phrases	Not required	`unsafe_mode`

Note

The SpeechKit API v3 may return multiple audio fragments in response to a single request. A complete response is a result of merging all the received fragments.

Streaming synthesis

SpeechKit allows you to send texts for synthesis one by one within a single session using the API v3. This is convenient if you do not have the entire text at once but need to generate the speech fast, e.g., when synthesizing real-time LLM responses.

Streaming synthesis does not support SpeechKit Brand Voice Lite voices yet.

Languages and voices

You can select a voice for converting your text to speech. Each voice matches a model trained on the speaker's speech pattern. Voices differ by pitch, gender, and language. For a list of voices and their specifications, see List of voices.

If no voice suits your task, SpeechKit can create a unique one specially for you. For more information, see Yandex SpeechKit Brand Voice.

SpeechKit can synthesize speech in different languages. Each voice is designed to synthesize speech in a specific language. The voices can also read text in another language, but the quality of the synthesized speech will be impaired in this case, as the speaker will pronounce the text with an accent, and there may be errors in word synthesis.

Role

The synthesized speech will sound differently depending on the selected role. A role is a manner of pronunciation for the same speaker. Different sets of roles are available for different voices. Attempting to use a role the selected voice does not have will result in a service error.

Voice pitch

Each SpeechKit voice has a certain pitch. In the API v3, you can change a voice by specifying a shift from its original pitch. This shift is set in the hints: pitch_shift parameter in the range of [-1000;1000] Hz. The default value is 0. Positive hints: pitch_shift values raise the voice pitch, negative ones lower it.

Pronunciation control

To have control over pronunciation in synthesized speech, explicitly mark up the source text. SpeechKit can synthesize speech from a text marked up using the Speech Synthesis Markup Language (SSML) or TTS markup. These markups allow setting the length of pauses, pronunciation of individual sounds, and many more. The SSML and TTS markups have different data transmission parameters:

SSML is only supported in API v1 requests. To transmit text in SSML format, include the ssml parameter in the request body and wrap the text using the <speak> tag. For more information about SSML tags, see SSML markup.
TTS is supported in the API v1 and API v3. In API v1 requests, transmit the text marked up according to the TTS rules in the text parameter in the request body. Neither the API v3 nor the Python SDK require special parameters and consider any transmitted text as marked up according to the TTS rules. For more information about using the TTS markup, see TTS markup.

Warning

When using pattern-based synthesis, the markup outside the variable part is ignored.

Synthesis settings

You can configure both pronunciation and specifications of the speech to synthesize.

Synthesized speech rate

The rate of synthesized speech affects information comprehension. If the speech is too fast or too slow, it sounds unnatural. However, this can be useful in commercials where every second of air time counts.

By default, a generated speech has the average human speech rate.

Volume normalization

In API v3 and Python SDK requests, you can set the type and level of volume normalization. You may need it when using SpeechKit synthesis along with other sound sources. For example, to adjust the volume of the voice assistant to the phone notifications.

SpeechKit supports two normalization types:

MAX_PEAK peak normalization, where the audio level rises to the maximum possible distortion-free value attainable for digital audio.
LUFS weighted normalization based on the EBU R 128 standard, where the volume is normalized relative to the full digital scale.

You can set the normalization type in the loudness_normalization_type parameter. By default, SpeechKit uses LUFS.

The normalization level is set in the hints: volume parameter. The possible values depend on the normalization type:

For MAX_PEAK, you can set it in the (0;1] range; the default value is 0.7.
For LUFS, you can set it in the [-149;0) range; the default value is -19.

If the normalization level falls outside the supported range, the SpeechKit server will return the InvalidArgument error.

Synthesized audio file format

In SpeechKit, you can select the audio file format for speech synthesis.

For the full list of available formats and their specifications, see Supported audio formats.

Use cases

Was the article helpful?

Detecting the end of utterance

List of voices

Speech synthesis

Synthesis featuresSynthesis features

Streaming synthesisStreaming synthesis

Languages and voicesLanguages and voices

RoleRole

Voice pitchVoice pitch

Pronunciation controlPronunciation control

Synthesis settingsSynthesis settings

Synthesized speech rateSynthesized speech rate

Volume normalizationVolume normalization

Synthesized audio file formatSynthesized audio file format

Use casesUse cases

See alsoSee also