Speech synthesis
Speech synthesis in Yandex SpeechKit allows you to convert any text to speech in multiple languages.
SpeechKit voice models use deep neural network technology. When synthesizing speech, a model registers a lot of features of the original voice. The model evaluates the entire text rather than individual sentences, before starting the synthesis. This helps make the synthesized voice sound clear and natural, without electronic distortion, and reproduce relevant human intonations.
The service is available at tts.api.cloud.yandex.net:443.
To try out the Text-to-Speech and Speech-to-Text product demos, visit the SpeechKit page on our website.
Synthesis features
To work with SpeechKit, you can access it via the API or Playground. For more information about working with the Yandex Cloud API, see API concepts.
SpeechKit synthesis has two APIs: API v1 and API v3.
| API v1 | API v3 | |
|---|---|---|
| Specification | REST | gRPC, REST |
| Voice selection | voice |
hints: voice |
| Language selection | Depends on the voice lang |
Depends on the voice, not specified explicitly in the request |
| Role setting | Depends on the voice emotion |
Depends on the voice hints: role |
| Voice pitch management | No | hints: pitch_shift |
| Pronunciation control | SSML TTS |
TTS |
| Pronunciation rate | speed |
hints: speed |
| Volume adjustment | No | loudness_normalization_type |
| Output audio format | format |
output_audio_spec |
| Setting LPCM parameters | sampleRateHertz |
output_audio_spec: raw_audio |
| Audio pattern-based synthesis | No | text_template |
| Pricing method | Total number of characters in requests | By request |
| Automatic splitting of long phrases | Not required | unsafe_mode |
Note
The SpeechKit API v3 may return multiple audio fragments in response to a single request. A complete response is a result of merging all the received fragments.
Streaming synthesis
SpeechKit allows you to send texts for synthesis one by one within a single session using the API v3. This is convenient if you do not have the entire text at once but need to generate the speech fast, e.g., when synthesizing real-time LLM responses.
Streaming synthesis does not support SpeechKit Brand Voice Lite voices yet.
Languages and voices
You can select a voice for converting your text to speech. Each voice matches a model trained on the speaker's speech pattern. Voices differ by pitch, gender, and language. For a list of voices and their specifications, see List of voices.
If no voice suits your task, SpeechKit can create a unique one specially for you. For more information, see Yandex SpeechKit Brand Voice.
SpeechKit can synthesize speech in different languages. Each voice is designed to synthesize speech in a specific language. The voices can also read text in another language, but the quality of the synthesized speech will be impaired in this case, as the speaker will pronounce the text with an accent, and there may be errors in word synthesis.
Role
The synthesized speech will sound differently depending on the selected role. A role is a manner of pronunciation for the same speaker. Different sets of roles are available for different voices. Attempting to use a role the selected voice does not have will result in a service error.
Voice pitch
Each SpeechKit voice has a certain pitch. In the API v3, you can change a voice by specifying a shift from its original pitch. This shift is set in the hints: pitch_shift parameter in the range of [-1000;1000] Hz. The default value is 0. Positive hints: pitch_shift values raise the voice pitch, negative ones lower it.
Pronunciation control
To have control over pronunciation in synthesized speech, explicitly mark up the source text. SpeechKit can synthesize speech from a text marked up using the Speech Synthesis Markup Language (SSML) or TTS markup. These markups allow setting the length of pauses, pronunciation of individual sounds, and many more. The SSML and TTS markups have different data transmission parameters:
- SSML is only supported in API v1 requests. To transmit text in SSML format, include the
ssmlparameter in the request body and wrap the text using the<speak>tag. For more information about SSML tags, see SSML markup. - TTS is supported in the API v1 and API v3. In API v1 requests, transmit the text marked up according to the TTS rules in the
textparameter in the request body. Neither the API v3 nor the Python SDK require special parameters and consider any transmitted text as marked up according to the TTS rules. For more information about using the TTS markup, see TTS markup.
Warning
When using pattern-based synthesis, the markup outside the variable part is ignored.
Synthesis settings
You can configure both pronunciation and specifications of the speech to synthesize.
Synthesized speech rate
The rate of synthesized speech affects information comprehension. If the speech is too fast or too slow, it sounds unnatural. However, this can be useful in commercials where every second of air time counts.
By default, a generated speech has the average human speech rate.
Volume normalization
In API v3 and Python SDK requests, you can set the type and level of volume normalization. You may need it when using SpeechKit synthesis along with other sound sources. For example, to adjust the volume of the voice assistant to the phone notifications.
SpeechKit supports two normalization types:
MAX_PEAKpeak normalization, where the audio level rises to the maximum possible distortion-free value attainable for digital audio.LUFSweighted normalization based on the EBU R 128 standard, where the volume is normalized relative to the full digital scale.
You can set the normalization type in the loudness_normalization_type parameter. By default, SpeechKit uses LUFS.
The normalization level is set in the hints: volume parameter. The possible values depend on the normalization type:
- For
MAX_PEAK, you can set it in the(0;1]range; the default value is0.7. - For
LUFS, you can set it in the[-149;0)range; the default value is-19.
If the normalization level falls outside the supported range, the SpeechKit server will return the InvalidArgument error.
Synthesized audio file format
In SpeechKit, you can select the audio file format for speech synthesis.
For the full list of available formats and their specifications, see Supported audio formats.
Use cases
- Speech synthesis in the API v3
- Speech synthesis in the REST API v3
- Speech synthesis in OggOpus format using the API v1
- Speech synthesis in WAV format using the API v1
- Speech synthesis from SSML text using API v1
- Streaming synthesis