Asynchronous recognition API v2

To use the API v2, you will need:

Yandex Object Storage bucket to which you will upload your audio file for recognition.
Service account with the ai.speechkit-stt.user and storage.uploader roles for accessing SpeechKit and Object Storage.
IAM token or API key for authentication.

For more information on getting started, see How to asynchronously recognize pre-recorded audio.

Warning

You can recognize audio files asynchronously only as a service account. Do not use any other Yandex Cloud accounts for the purpose.

The asynchronous recognition service for the API v2 is located at transcribe.api.cloud.yandex.net/speech/stt/v2/longRunningRecognize

Sending a file for recognition

Request body parameters

The request body structure is as follows:

{
         "config": {
          "specification": {
           "languageCode": "string",
           "model": "string",
           "profanityFilter": boolean,
           "literature_text": boolean,
           "audioEncoding": "string",
           "sampleRateHertz": integer,
           "audioChannelCount": integer,
           "rawResults": boolean
          }
         },
         "audio": {
          "uri": "string"
         }
        }

Parameter	Description
config	object Field with recognition settings.
config. specification	object Recognition settings.
config. specification. languageCode	string Language of the audio file for speech recognition. The default value is `ru-RU`, Russian.
config. specification. model	string Language model for speech recognition. The default value is `general`. Different models have different pricing.
config. specification. profanityFilter	boolean Profanity filter. Acceptable values: `true`: Mask profanities with asterisks in recognition results. `false` (default): Do not mask profanities.
config. specification. literature_text	boolean Enables normalization mode.
config. specification. audioEncoding	string Audio format. Acceptable values: `LINEAR16_PCM`: LPCM without a WAV header. `OGG_OPUS` (default): Ogg with the OPUS codec. `MP3`: MP3.
config. specification. sampleRateHertz	integer (int64) Audio sampling rate. This parameter is required if `format` equals `LINEAR16_PCM`. Valid values: `48000` (default): 48 kHz. `16000`: 16 kHz. `8000`: Sampling rate of 8 kHz.
config. specification. audioChannelCount	integer (int64) Number of channels for LPCM audio files. The default value is `1`. Do not use this field for OggOpus or MP3 audio files. They already contain information about the channel count.
config. specification. rawResults	boolean Flag that toggles spelling out numbers. Acceptable values: `true`: Spell out. `false` (default): Use figures.
audio. uri	string URI of the audio file for recognition. Supports only links to files stored in Yandex Object Storage.

Response

If your request is written correctly, the service returns the Operation object with the recognition operation ID (id):

{
         "done": false,
         "id": "e03sup6d5h1q********",
         "createdAt": "2019-04-21T22:49:29Z",
         "createdBy": "ajes08feato8********",
         "modifiedAt": "2019-04-21T22:49:29Z"
        }

Use this ID at the next step.

Getting recognition results

To check the operation status and get the recognition results, submit a request at operation.api.cloud.yandex.net.

Monitor the recognition results using the obtained ID. The number of result monitoring requests is limited: it takes about 10 seconds to recognize 1 minute of single-channel audio.

Warning

Recognition results are stored on the 3 days server. You can then request the recognition results using the obtained ID.

Path parameters

Parameter	Description
operationId	Operation ID received when sending the recognition request

Response

The Operation object is returned in response to your request. Response example:

{
         "done": true,
         "response": {
          "@type": "type.googleapis.com/yandex.cloud.ai.stt.v2.LongRunningRecognitionResponse",
          "chunks": [
           {
            "alternatives": [
             {
              "words": [
               {
                "startTime": "0.879999999s",
                "endTime": "1.159999992s",
                "word": "when",
                "confidence": 1
               },
               {
                "startTime": "1.219999995s",
                "endTime": "1.539999988s",
                "word": "writing",
                "confidence": 1
               },
               ...
              ],
              "text": "when writing The Hobbit, Tolkien referred to the Norse mythology of the Old English poem Beowulf",
              "confidence": 1
             }
            ],
            "channelTag": "1"
           },
           ...
          ]
         },
         "id": "e03sup6d5h1q********",
         "createdAt": "2019-04-21T22:49:29Z",
         "createdBy": "ajes08feato8********",
         "modifiedAt": "2019-04-21T22:49:36Z"
        }

Parameter	Description
done	boolean Contains `true` when the recognition is complete.
response	object Asynchronous speech recognition results
response. @type	string Response type
response. chunks	array Array with recognition results If speech recognition in the transmitted file fails, the response may not contain an array with the results.
response. chunks. alternatives	array Array with recognized text alternatives
response. chunks. alternatives. words	array Array with recognized words and their details
response. chunks. alternatives. words. startTime	string Word start time in the recording. An error of 1-2 seconds is possible.
response. chunks. alternatives. words. endTime	string Word end time in the recording. An error of 1-2 seconds is possible.
response. chunks. alternatives. words. word	string Recognized word. Recognized numbers are spelled out (e.g., `twelve` instead of `12`).
response. chunks. alternatives. words. confidence	integer (int64) This field is not supported. Do not use it.
response. chunks. alternatives. text	string Entire recognized text. By default, numbers are written in figures. To output the entire text in word form, set the `config.specification.rawResult` parameter to `true`.
response. chunks. alternatives. confidence	integer (int64) This field is not supported. Do not use it.
response. chunks. channelTag	string Audio channel recognition was performed for.
id	string Operation ID. Generated on the service side.
createdAt	google.protobuf.Timestamp Operation start time. Uses RFC3339 (Timestamps) format.
createdBy	string ID of the user who started the operation.
modifiedAt	google.protobuf.Timestamp Resource last update time. Uses RFC3339 (Timestamps) format.

For more information about the response format and codes, see Response status codes.

Use cases

Was the article helpful?

Streaming Recognition API

Overview

Asynchronous recognition API v2

Sending a file for recognitionSending a file for recognition

Request body parametersRequest body parameters

ResponseResponse

Getting recognition resultsGetting recognition results

Path parametersPath parameters

ResponseResponse

Use casesUse cases

Was the article helpful?

Sending a file for recognition

Request body parameters

Response

Getting recognition results

Path parameters

Response

Use cases