How to recognize long audio files in SpeechKit

The service can recognize speech in different ways. The provided example demonstrates asynchronous recognition of an audio file. Asynchronous recognition is available via API v3 and API v2. Asynchronous recognition is subject to these restrictions:

  • Maximum audio duration: 4 hours
  • Maximum file size: 1 GB

In the example, the API is used via the cURL utility. If you want to use the API via a Python script, see Step-by-step guides.

Getting started

  1. Create a bucket and upload to it the audio file you want to recognize.

  2. Create a service account.

    Warning

    You can recognize audio files asynchronously only as a service account. Do not use any other Yandex Cloud accounts for the purpose.

  3. Assign to the service account the storage.uploader and ai.speechkit-stt.user roles for the folder you had created the bucket in.

  4. Get an API key or IAM token for your service account.

  5. Download a sample audio file:

    • For API v3: a WAV file.
    • For API v2: an LPCM file.

Speech recognition

  1. Get a link to an audio file in Object Storage.

  2. Create a file, e.g., request.json, and add the following code to it:

    {
              "uri": "https://storage.yandexcloud.net/<bucket_name>/<path_to_WAV_file_in_bucket>",
              "recognition_model": {
                "model": "general",
                "audio_format": {
                  "container_audio": {
                    "container_audio_type": "WAV"
                  }
                }
              }
            }
            

    Where:

    • uri: Link to the audio file in Object Storage. Here is an example of such a link: https://storage.yandexcloud.net/speechkit/speech.wav.

      The link contains additional query parameters (after ?) for buckets with restricted access. You do not need to provide these parameters in SpeechKit as they are ignored.

    • model: Speech recognition model.

    • container_audio_type: Audio container format.

  3. Run the request using one of the service account authentication methods:

    • With an IAM token:

      export FOLDER_ID=<folder_ID>
              export IAM_TOKEN=<service_account_IAM_token> && \
              curl \
                --insecure \
                --header "Authorization: Bearer ${IAM_TOKEN}" \
                --header "x-folder-id: ${FOLDER_ID}" \
                --data @request.json https://stt.api.cloud.yandex.net:443/stt/v3/recognizeFileAsync
              

      Where:

      • FOLDER_ID: ID of the folder your service account was created in.
      • IAM_TOKEN: Service account IAM token.
    • With an API key.

      Use API keys if requesting an IAM token automatically is not an option.

      export FOLDER_ID=<folder_ID>
              export API_KEY=<service_account_API_key> && \
              curl \
                --insecure \
                --header "Authorization: Api-Key ${API_KEY}" \
                --header "x-folder-id: ${FOLDER_ID}" \
                --data @request.json https://stt.api.cloud.yandex.net:443/stt/v3/recognizeFileAsync
              

    Result example:

    {
               "id":"f8ddr61b30fk********",
               "description":"STT v3 async recognition",
               "createdAt":"2024-07-15T07:39:36Z",
               "createdBy":"ajehumcuv38h********",
               "modifiedAt":"2024-07-15T07:39:36Z",
               "done":false,
               "metadata":null
            }
            

    Save the recognition operation id you get in the response.

  4. Wait until the recognition is complete. It takes about 10 seconds to recognize one minute of audio.

  5. Request information about the operation:

    • Using IAM token authentication:

      curl \
                --insecure \
                --request GET \
                --header "Authorization: Bearer ${IAM_TOKEN}" \
                --header "x-folder-id: ${FOLDER_ID}" \
                https://operation.api.cloud.yandex.net/operations/<recognition_operation_ID>
              
    • Authentication with an API key:

      curl \
                --insecure \
                --request GET \
                --header "Authorization: Api-key ${API_KEY}" \
                --header "x-folder-id: ${FOLDER_ID}" \
                https://operation.api.cloud.yandex.net/operations/<recognition_operation_ID>
              

    Result example:

    {
               "done": true,
               "id": "f8ddr61b30fk********",
               "description": "STT v3 async recognition",
               "createdAt": "2024-07-15T07:39:36Z",
               "createdBy": "ajehumcuv38h********",
               "modifiedAt": "2024-07-15T07:39:37Z"
            }
            
  6. Request the operation result:

    • Using IAM token authentication:

      curl \
                --insecure \
                --request GET \
                --header "Authorization: Bearer ${IAM_TOKEN}" \
                --header "x-folder-id: ${FOLDER_ID}" \
                https://stt.api.cloud.yandex.net:443/stt/v3/getRecognition?operation_id=<recognition_operation_ID>
              
    • Authentication with an API key:

      curl \
                --insecure \
                --request GET \
                --header "Authorization: Api-key ${API_KEY}" \
                --header "x-folder-id: ${FOLDER_ID}" \
                https://stt.api.cloud.yandex.net:443/stt/v3/getRecognition?operation_id=<recognition_operation_ID>
              
    Result example
    {
             "result": {
                "sessionUuid": {
                   "uuid": "24935f24-2c1f62dc-8dd49006-********",
                   "userRequestId": "f8d2h7m07t4i********"
                },
                "audioCursors": {
                   "receivedDataMs": "7400",
                   "resetTimeMs": "0",
                   "partialTimeMs": "7400",
                   "finalTimeMs": "7400",
                   "finalIndex": "0",
                   "eouTimeMs": "0"
                },
                "responseWallTimeMs": "189",
                "final": {
                   "alternatives": [
                      {
                         "words": [
                            {
                               "text": "я",
                               "startTimeMs": "459",
                               "endTimeMs": "520"
                            },
                            {
                               "text": "яндекс",
                               "startTimeMs": "640",
                               "endTimeMs": "1060"
                            },
                            {
                               "text": "спичкит",
                               "startTimeMs": "1120",
                               "endTimeMs": "1959"
                            },
                            {
                               "text": "я",
                               "startTimeMs": "2480",
                               "endTimeMs": "2520"
                            },
                            {
                               "text": "могу",
                               "startTimeMs": "2580",
                               "endTimeMs": "2800"
                            },
                            {
                               "text": "превратить",
                               "startTimeMs": "2860",
                               "endTimeMs": "3360"
                            },
                            {
                               "text": "любой",
                               "startTimeMs": "3439",
                               "endTimeMs": "3709"
                            },
                            {
                               "text": "текст",
                               "startTimeMs": "3800",
                               "endTimeMs": "4140"
                            },
                            {
                               "text": "в",
                               "startTimeMs": "4200",
                               "endTimeMs": "4220"
                            },
                            {
                               "text": "речь",
                               "startTimeMs": "4279",
                               "endTimeMs": "4740"
                            },
                            {
                               "text": "теперь",
                               "startTimeMs": "5140",
                               "endTimeMs": "5759"
                            },
                            {
                               "text": "и",
                               "startTimeMs": "5859",
                               "endTimeMs": "5900"
                            },
                            {
                               "text": "вы",
                               "startTimeMs": "5980",
                               "endTimeMs": "6399"
                            },
                            {
                               "text": "можете",
                               "startTimeMs": "6660",
                               "endTimeMs": "7180"
                            }
                         ],
                         "text": "я яндекс спичкит я могу превратить любой текст в речь теперь и вы можете",
                         "startTimeMs": "0",
                         "endTimeMs": "7400",
                         "confidence": 0,
                         "languages": []
                      }
                   ],
                   "channelTag": "0"
                },
                "channelTag": "0"
             }
            }
            {
             "result": {
                "sessionUuid": {
                   "uuid": "24935f24-2c1f62dc-8dd49006-********",
                   "userRequestId": "f8d2h7m07t4i********"
                },
                "audioCursors": {
                   "receivedDataMs": "7400",
                   "resetTimeMs": "0",
                   "partialTimeMs": "7400",
                   "finalTimeMs": "7400",
                   "finalIndex": "0",
                   "eouTimeMs": "0"
                },
                "responseWallTimeMs": "189",
                "finalRefinement": {
                   "finalIndex": "0",
                   "normalizedText": {
                      "alternatives": [
                         {
                            "words": [
                               {
                                  "text": "я",
                                  "startTimeMs": "459",
                                  "endTimeMs": "520"
                               },
                               {
                                  "text": "яндекс",
                                  "startTimeMs": "640",
                                  "endTimeMs": "1060"
                               },
                               {
                                  "text": "спичкит",
                                  "startTimeMs": "1120",
                                  "endTimeMs": "1959"
                               },
                               {
                                  "text": "я",
                                  "startTimeMs": "2480",
                                  "endTimeMs": "2520"
                               },
                               {
                                  "text": "могу",
                                  "startTimeMs": "2580",
                                  "endTimeMs": "2800"
                               },
                               {
                                  "text": "превратить",
                                  "startTimeMs": "2860",
                                  "endTimeMs": "3360"
                               },
                               {
                                  "text": "любой",
                                  "startTimeMs": "3439",
                                  "endTimeMs": "3709"
                               },
                               {
                                  "text": "текст",
                                  "startTimeMs": "3800",
                                  "endTimeMs": "4140"
                               },
                               {
                                  "text": "в",
                                  "startTimeMs": "4200",
                                  "endTimeMs": "4220"
                               },
                               {
                                  "text": "речь",
                                  "startTimeMs": "4279",
                                  "endTimeMs": "4740"
                               },
                               {
                                  "text": "теперь",
                                  "startTimeMs": "5140",
                                  "endTimeMs": "5759"
                               },
                               {
                                  "text": "и",
                                  "startTimeMs": "5859",
                                  "endTimeMs": "5900"
                               },
                               {
                                  "text": "вы",
                                  "startTimeMs": "5980",
                                  "endTimeMs": "6399"
                               },
                               {
                                  "text": "можете",
                                  "startTimeMs": "6660",
                                  "endTimeMs": "7180"
                               }
                            ],
                            "text": "Я яндекс спичкит я могу превратить любой текст в речь теперь и вы можете",
                            "startTimeMs": "0",
                            "endTimeMs": "7400",
                            "confidence": 0,
                            "languages": []
                         }
                      ],
                      "channelTag": "0"
                   }
                },
                "channelTag": "0"
             }
            }
            {
             "result": {
                "sessionUuid": {
                   "uuid": "24935f24-2c1f62dc-8dd49006-********",
                   "userRequestId": "f8d2h7m07t4i********"
                },
                "audioCursors": {
                   "receivedDataMs": "7400",
                   "resetTimeMs": "0",
                   "partialTimeMs": "7400",
                   "finalTimeMs": "7400",
                   "finalIndex": "0",
                   "eouTimeMs": "7400"
                },
                "responseWallTimeMs": "190",
                "eouUpdate": {
                   "timeMs": "7400"
                },
                "channelTag": "0"
             }
            }
            
  1. Get a link to an audio file in Object Storage.

  2. Create a file named body.json and add the following code to it:

    {
               "config": {
                  "specification": {
                     "languageCode": "ru-RU",
                     "model": "general",
                     "audioEncoding": "LINEAR16_PCM",
                     "sampleRateHertz": 8000,
                     "audioChannelCount": 1
                  }
               },
               "audio": {
                  "uri": "<link_to_audio_file>"
               }
            }
            

    Where:

    • languageCode: Recognition language.

    • model: Speech recognition model.

    • audioEncoding: Audio file format.

    • sampleRateHertz: Audio file sampling rate in Hz.

    • audioChannelCount: Number of audio channels.

    • uri: Link to the audio file in Object Storage. Here is an example of such a link: https://storage.yandexcloud.net/speechkit/speech.pcm.

      The link contains additional query parameters (after ?) for buckets with restricted access. You do not need to provide these parameters in SpeechKit as they are ignored.

  3. Run the file you created:

    export API_KEY=<service_account_API_key> && \
            curl \
              --insecure \
              --header "Authorization: Api-Key ${API_KEY}" \
              --data "@body.json"\
              https://transcribe.api.cloud.yandex.net/speech/stt/v2/longRunningRecognize
            

    Result example:

    {
               "done": false,
               "id": "e03sup6d5h1q********",
               "createdAt": "2019-04-21T22:49:29Z",
               "createdBy": "ajes08feato8********",
               "modifiedAt": "2019-04-21T22:49:29Z"
            }
            

    Save the recognition operation id you get in the response.

  4. Wait until the recognition is completed. It takes about 10 seconds to recognize one minute of single-channel audio.

  5. Send a request to get information about the operation:

    curl \
              --insecure \
              --header "Authorization: Api-key ${API_KEY}" \
              https://operation.api.cloud.yandex.net/operations/<recognition_operation_ID>
            

    Result example:

    {
               "done": true,
               "response": {
                  "@type": "type.googleapis.com/yandex.cloud.ai.stt.v2.LongRunningRecognitionResponse",
                  "chunks": [
                     {
                        "alternatives": [
                           {
                              "words": [
                                 {
                                    "startTime": "0.160s",
                                    "endTime": "0.500s",
                                    "word": "hello",
                                    "confidence": 1
                                 },
                                 {
                                    "startTime": "0.580s",
                                    "endTime": "0.800s",
                                    "word": "world",
                                    "confidence": 1
                                 }
                              ],
                              "text": "Hello world",
                              "confidence": 1
                           }
                        ],
                        "channelTag": "1"
                     }
                  ]
               },
               "id": "e03jjenu23uc********",
               "createdAt": "2024-08-22T11:39:22Z",
               "createdBy": "aje3bg430agh********",
               "modifiedAt": "2024-08-22T11:39:23Z"
            }
            

    If speech recognition in the provided file fails, the response.chunks section may be missing from the response.

See also