Asynchronous recognition API v2

To use the API v2, you will need:

For more information on getting started, see How to asynchronously recognize pre-recorded audio.

Warning

You can recognize audio files asynchronously only as a service account. Do not use any other Yandex Cloud accounts for the purpose.

The asynchronous recognition service for the API v2 is located at transcribe.api.cloud.yandex.net/speech/stt/v2/longRunningRecognize

Sending a file for recognition

Request body parameters

The request body structure is as follows:

{
         "config": {
          "specification": {
           "languageCode": "string",
           "model": "string",
           "profanityFilter": boolean,
           "literature_text": boolean,
           "audioEncoding": "string",
           "sampleRateHertz": integer,
           "audioChannelCount": integer,
           "rawResults": boolean
          }
         },
         "audio": {
          "uri": "string"
         }
        }
        

Parameter

Description

config

object
Field with recognition settings.

config.
specification

object
Recognition settings.

config.
specification.
languageCode

string
Language of the audio file for speech recognition.
The default value is ru-RU, Russian.

config.
specification.
model

string
Language model for speech recognition.
The default value is general.
Different models have different pricing.

config.
specification.
profanityFilter

boolean
Profanity filter.
Acceptable values:

  • true: Mask profanities with asterisks in recognition results.
  • false (default): Do not mask profanities.

config.
specification.
literature_text

boolean
Enables normalization mode.

config.
specification.
audioEncoding

string
Audio format.
Acceptable values:

config.
specification.
sampleRateHertz

integer (int64)
Audio sampling rate.
This parameter is required if format equals LINEAR16_PCM. Valid values:

  • 48000 (default): 48 kHz.
  • 16000: 16 kHz.
  • 8000: Sampling rate of 8 kHz.

config.
specification.
audioChannelCount

integer (int64)
Number of channels for LPCM audio files. The default value is 1.
Do not use this field for OggOpus or MP3 audio files. They already contain information about the channel count.

config.
specification.
rawResults

boolean
Flag that toggles spelling out numbers.
Acceptable values:

  • true: Spell out.
  • false (default): Use figures.

audio.
uri

string
URI of the audio file for recognition. Supports only links to files stored in Yandex Object Storage.

Response

If your request is written correctly, the service returns the Operation object with the recognition operation ID (id):

{
         "done": false,
         "id": "e03sup6d5h1q********",
         "createdAt": "2019-04-21T22:49:29Z",
         "createdBy": "ajes08feato8********",
         "modifiedAt": "2019-04-21T22:49:29Z"
        }
        

Use this ID at the next step.

Getting recognition results

To check the operation status and get the recognition results, submit a request at operation.api.cloud.yandex.net.

Monitor the recognition results using the obtained ID. The number of result monitoring requests is limited: it takes about 10 seconds to recognize 1 minute of single-channel audio.

Warning

Recognition results are stored on the 3 days server. You can then request the recognition results using the obtained ID.

Path parameters

Parameter Description
operationId Operation ID received when sending the recognition request

Response

The Operation object is returned in response to your request. Response example:

{
         "done": true,
         "response": {
          "@type": "type.googleapis.com/yandex.cloud.ai.stt.v2.LongRunningRecognitionResponse",
          "chunks": [
           {
            "alternatives": [
             {
              "words": [
               {
                "startTime": "0.879999999s",
                "endTime": "1.159999992s",
                "word": "when",
                "confidence": 1
               },
               {
                "startTime": "1.219999995s",
                "endTime": "1.539999988s",
                "word": "writing",
                "confidence": 1
               },
               ...
              ],
              "text": "when writing The Hobbit, Tolkien referred to the Norse mythology of the Old English poem Beowulf",
              "confidence": 1
             }
            ],
            "channelTag": "1"
           },
           ...
          ]
         },
         "id": "e03sup6d5h1q********",
         "createdAt": "2019-04-21T22:49:29Z",
         "createdBy": "ajes08feato8********",
         "modifiedAt": "2019-04-21T22:49:36Z"
        }
        

Parameter

Description

done

boolean
Contains true when the recognition is complete.

response

object
Asynchronous speech recognition results

response.
@type

string
Response type

response.
chunks

array
Array with recognition results
If speech recognition in the transmitted file fails, the response may not contain an array with the results.

response.
chunks.
alternatives

array
Array with recognized text alternatives

response.
chunks.
alternatives.
words

array
Array with recognized words and their details

response.
chunks.
alternatives.
words.
startTime

string
Word start time in the recording. An error of 1-2 seconds is possible.

response.
chunks.
alternatives.
words.
endTime

string
Word end time in the recording. An error of 1-2 seconds is possible.

response.
chunks.
alternatives.
words.
word

string
Recognized word. Recognized numbers are spelled out (e.g., twelve instead of 12).

response.
chunks.
alternatives.
words.
confidence

integer (int64)
This field is not supported. Do not use it.

response.
chunks.
alternatives.
text

string
Entire recognized text. By default, numbers are written in figures. To output the entire text in word form, set the config.specification.rawResult parameter to true.

response.
chunks.
alternatives.
confidence

integer (int64)
This field is not supported. Do not use it.

response.
chunks.
channelTag

string
Audio channel recognition was performed for.

id

string
Operation ID. Generated on the service side.

createdAt

google.protobuf.Timestamp
Operation start time. Uses RFC3339 (Timestamps) format.

createdBy

string
ID of the user who started the operation.

modifiedAt

google.protobuf.Timestamp
Resource last update time. Uses RFC3339 (Timestamps) format.

For more information about the response format and codes, see Response status codes.

Use cases