Speech recognition
Speech recognition is speech-to-text (STT) conversion.
To work with SpeechKit, you can access it via the API or Playground. For more information about working with the Yandex Cloud API, see API concepts.
The service is available at stt.api.cloud.yandex.net:443.
Recognition methods
SpeechKit provides two ways of speech recognition:
- Streaming recognition is used for real-time speech recognition. For streaming recognition, SpeechKit receives short audio fragments and sends the results, including intermediate ones, within a single connection.
- Audio file recognition. SpeechKit can recognize audio recordings in synchronous and asynchronous modes.
- Synchronous mode has strict limitations on file size and duration and is suitable for single-channel audio recordings of up to 30 seconds.
- Asynchronous mode can process multi-channel audio recordings. Maximum recording duration: 4 hours.
Which recognition to choose
| Streaming recognition | Synchronous recognition | Asynchronous recognition | |
|---|---|---|---|
| Use cases | Phone assistants and robots Virtual assistants |
Virtual assistants Voice control Speech recognition of short voice messages in messengers |
Transcribing audio calls and presentations Subtitling Call center script compliance monitoring Identifying successful scripts Evaluating performance of call center agents |
| Input data | Real-time voice | Pre-recorded short single-channel audio files | Pre-recorded multi-channel and long audio files |
| How it works | Exchanging messages with the server within a single connection | Request — quick response | Request — delayed response |
| Supported APIs | gRPC v2 gRPC v3 |
REST v1 | REST v2 REST v3 gRPC v3 |
| Maximum duration of audio data | 5 minutes | 30 seconds | 4 hours |
| Maximum amount of transmitted data | 10 MB | 1 MB | 1 GB |
| Number of recognition channels | 1 | 1 | 2 |
Recognition process
Audio is recognized in three stages:
- The acoustic model determines which set of low-level attributes matches the audio signal.
- The language model uses the acoustic model output to generate text word-by-word.
- The service processes the text: adds punctuation marks, converts numerals into numbers, and more.
Recognition accuracy
Recognition accuracy depends on the recognition model. You can improve recognition accuracy of the model by providing data for its fine-tuning. For more information about model fine-tuning, see Extending a speech recognition model.
The accuracy of speech recognition is also affected by:
- Original sound quality.
- Audio encoding quality.
- Speech intelligibility and rate.
- Utterance complexity and length.
Use cases
- Audio file streaming recognition using the API v3
- Streaming speech recognition with auto language detection in the API v3
- Asynchronous WAV audio file recognition using the API v3
- Example of using the API v1 for synchronous recognition