Yandex SpeechKit technology overview

With the Yandex SpeechKit voice technologies, you can solve any task related to human speech. SpeechKit can recognize speech either in real time or from pre-recorded audio files while automatically detecting the speaker's language. It can also voice template phrases and long texts using the SpeechKit standard voices.

SpeechKit uses APIs. Depending on the task, you can use either the gRPC or REST interfaces. For more on APIs in Yandex Cloud, see Yandex Cloud API concepts.

The table below provides the most common SpeechKit use cases so that you can choose the right technology and tailor it to your tasks.

Description	Recommended technology	Features and settings
Voice robot
Full or partial automation of phone conversations with customers.	For a user request: Streaming recognition. For a system response: Speech synthesis using standard voices and a tailor-made Brand Voice.	Both intermediate and final recognition results. Controlling pronunciation by marking up the text to synthesize. Pattern-based speech synthesis.
Speech analytics Quality control of agent performance
Transcribing and further analysis of audio recordings of conversations between customers and call center agents or robots.	To recognize pre-recorded audio files: Asynchronous recognition of audio files.	Word start and end timestamps in the recognition results. Recognition result normalization. Deferred mode for asynchronous recognition of audio files. Quotas and limits in SpeechKit
Voice control in apps and smart devices Voice assistant
The user requests an action or search using voice and the service responds with an action with a voice comment or an image.	For a user request: Streaming recognition. For a system response: Speech synthesis using standard voices and a Brand Voice.	Both intermediate and final recognition results. Controlling pronunciation by marking up the text to synthesize. Recognition result normalization.
Adjustments for visually impaired users
Voice control, voice guidance and comments for visually impaired users.	For a user request: Streaming recognition. For a system response: Speech synthesis using standard voices and a Brand Voice.	Both intermediate and final recognition results. Controlling pronunciation by marking up the text to synthesize.
Recognizing audio recordings made at meetings
Transcribing the audio recordings after the meeting is ended.	To recognize pre-recorded audio files: Asynchronous recognition of audio files.	Deferred mode for asynchronous recognition of audio files. Quotas and limits in SpeechKit Word start and end timestamps in the recognition results. Recognition result normalization.
Voicing books and videos
Voicing a book or video with no human speaker involved.	Speech synthesis using standard voices and a Brand Voice.	Controlling pronunciation by marking up the text to synthesize. Quotas and limits in SpeechKit
Recording the minutes of a meeting
Transcribing the meeting minutes in real time.	To recognize the participants' speech: Streaming recognition.	Both intermediate and final recognition results. Recognition result normalization.
Video subtitles
Creating subtitles for recorded videos.	To recognize an audio track: Asynchronous recognition of audio files.	Deferred mode for asynchronous recognition of audio files. Word start and end timestamps in the recognition results. Recognition result normalization. Quotas and limits in SpeechKit
Broadcast subtitles
Transcribing broadcasts in real time.	To recognize the broadcast speech: Streaming recognition.	Both intermediate and final recognition results. Recognition result normalization.
Transcribing voice messages
Converting short voice messages to text in messengers.	To recognize audio files: Synchronous recognition.	Recognition result settings.

Was the article helpful?

Headers for troubleshooting

Overview