Asynchronous WAV audio file recognition using the API v3

The example below illustrates how to use the SpeechKit API v3 for asynchronous speech recognition from a WAV audio file. This example uses the following parameters:

Authentication is performed under a service account using an API key or IAM token. For more information about authentication in the SpeechKit API, see the API reference.

Getting started

  1. Create a bucket and upload to it the audio file you want to recognize.

  2. Create a service account.

    Warning

    You can recognize audio files asynchronously only as a service account. Do not use any other Yandex Cloud accounts for the purpose.

  3. Assign to the service account the storage.uploader and ai.speechkit-stt.user roles for the folder you had created the bucket in.

  4. Get an IAM token or API key for the created service account.

If you do not have a WAV audio file, you can use this sample file.

Perform speech recognition via the API v3

  1. Get a link to an audio file in Object Storage.

  2. Create a file, e.g., request.json, and add the following code to it:

    {
              "uri": "https://storage.yandexcloud.net/<bucket_name>/<path_to_WAV_file_in_bucket>",
              "recognition_model": {
                "model": "general",
                "audio_format": {
                  "container_audio": {
                    "container_audio_type": "WAV"
                  }
                }
              }
            }
            

    Where:

    • uri: Link to the audio file in Object Storage. Here is an example of such a link: https://storage.yandexcloud.net/speechkit/speech.wav.

      The link contains additional query parameters (after ?) for buckets with restricted access. You do not need to provide these parameters in SpeechKit as they are ignored.

    • model: Speech recognition model.

    • container_audio_type: Audio container format.

  3. Run the request using one of the service account authentication methods:

    • With an IAM token:

      export FOLDER_ID=<folder_ID>
              export IAM_TOKEN=<service_account_IAM_token> && \
              curl \
                --insecure \
                --header "Authorization: Bearer ${IAM_TOKEN}" \
                --header "x-folder-id: ${FOLDER_ID}" \
                --data @request.json https://stt.api.cloud.yandex.net:443/stt/v3/recognizeFileAsync
              

      Where:

      • FOLDER_ID: ID of the folder your service account was created in.
      • IAM_TOKEN: Service account IAM token.
    • With an API key.

      Use API keys if requesting an IAM token automatically is not an option.

      export FOLDER_ID=<folder_ID>
              export API_KEY=<service_account_API_key> && \
              curl \
                --insecure \
                --header "Authorization: Api-Key ${API_KEY}" \
                --header "x-folder-id: ${FOLDER_ID}" \
                --data @request.json https://stt.api.cloud.yandex.net:443/stt/v3/recognizeFileAsync
              

    Result example:

    {
               "id":"f8ddr61b30fk********",
               "description":"STT v3 async recognition",
               "createdAt":"2024-07-15T07:39:36Z",
               "createdBy":"ajehumcuv38h********",
               "modifiedAt":"2024-07-15T07:39:36Z",
               "done":false,
               "metadata":null
            }
            

    Save the recognition operation id you get in the response.

  4. Wait until the recognition is complete. It takes about 10 seconds to recognize one minute of audio.

  5. Request information about the operation:

    • Using IAM token authentication:

      curl \
                --insecure \
                --request GET \
                --header "Authorization: Bearer ${IAM_TOKEN}" \
                --header "x-folder-id: ${FOLDER_ID}" \
                https://operation.api.cloud.yandex.net/operations/<recognition_operation_ID>
              
    • Authentication with an API key:

      curl \
                --insecure \
                --request GET \
                --header "Authorization: Api-key ${API_KEY}" \
                --header "x-folder-id: ${FOLDER_ID}" \
                https://operation.api.cloud.yandex.net/operations/<recognition_operation_ID>
              

    Result example:

    {
               "done": true,
               "id": "f8ddr61b30fk********",
               "description": "STT v3 async recognition",
               "createdAt": "2024-07-15T07:39:36Z",
               "createdBy": "ajehumcuv38h********",
               "modifiedAt": "2024-07-15T07:39:37Z"
            }
            
  6. Request the operation result:

    • Using IAM token authentication:

      curl \
                --insecure \
                --request GET \
                --header "Authorization: Bearer ${IAM_TOKEN}" \
                --header "x-folder-id: ${FOLDER_ID}" \
                https://stt.api.cloud.yandex.net:443/stt/v3/getRecognition?operation_id=<recognition_operation_ID>
              
    • Authentication with an API key:

      curl \
                --insecure \
                --request GET \
                --header "Authorization: Api-key ${API_KEY}" \
                --header "x-folder-id: ${FOLDER_ID}" \
                https://stt.api.cloud.yandex.net:443/stt/v3/getRecognition?operation_id=<recognition_operation_ID>
              
    Result example
    {
             "result": {
                "sessionUuid": {
                   "uuid": "24935f24-2c1f62dc-8dd49006-********",
                   "userRequestId": "f8d2h7m07t4i********"
                },
                "audioCursors": {
                   "receivedDataMs": "7400",
                   "resetTimeMs": "0",
                   "partialTimeMs": "7400",
                   "finalTimeMs": "7400",
                   "finalIndex": "0",
                   "eouTimeMs": "0"
                },
                "responseWallTimeMs": "189",
                "final": {
                   "alternatives": [
                      {
                         "words": [
                            {
                               "text": "я",
                               "startTimeMs": "459",
                               "endTimeMs": "520"
                            },
                            {
                               "text": "яндекс",
                               "startTimeMs": "640",
                               "endTimeMs": "1060"
                            },
                            {
                               "text": "спичкит",
                               "startTimeMs": "1120",
                               "endTimeMs": "1959"
                            },
                            {
                               "text": "я",
                               "startTimeMs": "2480",
                               "endTimeMs": "2520"
                            },
                            {
                               "text": "могу",
                               "startTimeMs": "2580",
                               "endTimeMs": "2800"
                            },
                            {
                               "text": "превратить",
                               "startTimeMs": "2860",
                               "endTimeMs": "3360"
                            },
                            {
                               "text": "любой",
                               "startTimeMs": "3439",
                               "endTimeMs": "3709"
                            },
                            {
                               "text": "текст",
                               "startTimeMs": "3800",
                               "endTimeMs": "4140"
                            },
                            {
                               "text": "в",
                               "startTimeMs": "4200",
                               "endTimeMs": "4220"
                            },
                            {
                               "text": "речь",
                               "startTimeMs": "4279",
                               "endTimeMs": "4740"
                            },
                            {
                               "text": "теперь",
                               "startTimeMs": "5140",
                               "endTimeMs": "5759"
                            },
                            {
                               "text": "и",
                               "startTimeMs": "5859",
                               "endTimeMs": "5900"
                            },
                            {
                               "text": "вы",
                               "startTimeMs": "5980",
                               "endTimeMs": "6399"
                            },
                            {
                               "text": "можете",
                               "startTimeMs": "6660",
                               "endTimeMs": "7180"
                            }
                         ],
                         "text": "я яндекс спичкит я могу превратить любой текст в речь теперь и вы можете",
                         "startTimeMs": "0",
                         "endTimeMs": "7400",
                         "confidence": 0,
                         "languages": []
                      }
                   ],
                   "channelTag": "0"
                },
                "channelTag": "0"
             }
            }
            {
             "result": {
                "sessionUuid": {
                   "uuid": "24935f24-2c1f62dc-8dd49006-********",
                   "userRequestId": "f8d2h7m07t4i********"
                },
                "audioCursors": {
                   "receivedDataMs": "7400",
                   "resetTimeMs": "0",
                   "partialTimeMs": "7400",
                   "finalTimeMs": "7400",
                   "finalIndex": "0",
                   "eouTimeMs": "0"
                },
                "responseWallTimeMs": "189",
                "finalRefinement": {
                   "finalIndex": "0",
                   "normalizedText": {
                      "alternatives": [
                         {
                            "words": [
                               {
                                  "text": "я",
                                  "startTimeMs": "459",
                                  "endTimeMs": "520"
                               },
                               {
                                  "text": "яндекс",
                                  "startTimeMs": "640",
                                  "endTimeMs": "1060"
                               },
                               {
                                  "text": "спичкит",
                                  "startTimeMs": "1120",
                                  "endTimeMs": "1959"
                               },
                               {
                                  "text": "я",
                                  "startTimeMs": "2480",
                                  "endTimeMs": "2520"
                               },
                               {
                                  "text": "могу",
                                  "startTimeMs": "2580",
                                  "endTimeMs": "2800"
                               },
                               {
                                  "text": "превратить",
                                  "startTimeMs": "2860",
                                  "endTimeMs": "3360"
                               },
                               {
                                  "text": "любой",
                                  "startTimeMs": "3439",
                                  "endTimeMs": "3709"
                               },
                               {
                                  "text": "текст",
                                  "startTimeMs": "3800",
                                  "endTimeMs": "4140"
                               },
                               {
                                  "text": "в",
                                  "startTimeMs": "4200",
                                  "endTimeMs": "4220"
                               },
                               {
                                  "text": "речь",
                                  "startTimeMs": "4279",
                                  "endTimeMs": "4740"
                               },
                               {
                                  "text": "теперь",
                                  "startTimeMs": "5140",
                                  "endTimeMs": "5759"
                               },
                               {
                                  "text": "и",
                                  "startTimeMs": "5859",
                                  "endTimeMs": "5900"
                               },
                               {
                                  "text": "вы",
                                  "startTimeMs": "5980",
                                  "endTimeMs": "6399"
                               },
                               {
                                  "text": "можете",
                                  "startTimeMs": "6660",
                                  "endTimeMs": "7180"
                               }
                            ],
                            "text": "Я яндекс спичкит я могу превратить любой текст в речь теперь и вы можете",
                            "startTimeMs": "0",
                            "endTimeMs": "7400",
                            "confidence": 0,
                            "languages": []
                         }
                      ],
                      "channelTag": "0"
                   }
                },
                "channelTag": "0"
             }
            }
            {
             "result": {
                "sessionUuid": {
                   "uuid": "24935f24-2c1f62dc-8dd49006-********",
                   "userRequestId": "f8d2h7m07t4i********"
                },
                "audioCursors": {
                   "receivedDataMs": "7400",
                   "resetTimeMs": "0",
                   "partialTimeMs": "7400",
                   "finalTimeMs": "7400",
                   "finalIndex": "0",
                   "eouTimeMs": "7400"
                },
                "responseWallTimeMs": "190",
                "eouUpdate": {
                   "timeMs": "7400"
                },
                "channelTag": "0"
             }
            }
            
  1. Clone the Yandex Cloud API repository:

    git clone https://github.com/yandex-cloud/cloudapi
            
  2. Use the pip package manager to install the grpcio-tools package:

    pip install grpcio-tools
            
  3. Go to the directory hosting the cloned Yandex Cloud API repository, create a directory named output, and generate the client interface code there:

    cd <path_to_cloudapi_directory>
            mkdir output
            python3 -m grpc_tools.protoc -I . -I third_party/googleapis \
              --python_out=output \
              --grpc_python_out=output \
              google/api/http.proto \
              google/api/annotations.proto \
              yandex/cloud/api/operation.proto \
              google/rpc/status.proto \
              yandex/cloud/operation/operation.proto \
              yandex/cloud/validation.proto \
              yandex/cloud/ai/stt/v3/stt_service.proto \
              yandex/cloud/ai/stt/v3/stt.proto
            

    The stt_pb2.py, stt_pb2_grpc.py, stt_service_pb2.py, and stt_service_pb2_grpc.py client interface files, as well as dependency files, will be created in the output folder.

  4. Create a file, e.g., test.py, in the output folder root and add the following API request code to it:

    import grpc
            from yandex.cloud.ai.stt.v3 import stt_pb2, stt_service_pb2_grpc
            
            request = stt_pb2.RecognizeFileRequest(
              uri='https://storage.yandexcloud.net/<bucket_name>/<path_to_WAV_file_in_bucket>',
              recognition_model=stt_pb2.RecognitionModelOptions(
                model='general',
                audio_format=stt_pb2.AudioFormatOptions(
                  container_audio=stt_pb2.ContainerAudio(
                    container_audio_type=stt_pb2.ContainerAudio.WAV
                  )
                )
              )
            )
            
            cred = grpc.ssl_channel_credentials()
            chan = grpc.secure_channel('stt.api.cloud.yandex.net:443', cred)
            stub = stt_service_pb2_grpc.AsyncRecognizerStub(chan)
            
            # Choose one of the authentication methods:
            
            # Authentication with an IAM token
            response = stub.RecognizeFile(request, metadata=[('authorization', 'Bearer <IAM_token>')])
            
            # Authentication with an API key
            # response = stub.RecognizeFile(request, metadata=[('authorization', 'Api-Key <API_key>')])
            
            print(response)
            
  5. Run this query:

    python3 test.py
            

    Result:

    id: "f8dem628l2mq********"
            description: "STT v3 async recognition"
            created_at {
              seconds: 1721032219
            }
            created_by: "ajehumcuv38h********"
            modified_at {
              seconds: 1721032219
            }
            

    Save the recognition operation id you get in the response.

  6. Create a file, e.g., results.py, in the output folder root and add the following code to it to get the operation result:

    import grpc
            from yandex.cloud.ai.stt.v3 import stt_pb2, stt_service_pb2_grpc, stt_service_pb2
            
            request = stt_service_pb2.GetRecognitionRequest(
                operation_id="<operation_ID>"
            )
            
            cred = grpc.ssl_channel_credentials()
            chan = grpc.secure_channel('stt.api.cloud.yandex.net:443', cred)
            stub = stt_service_pb2_grpc.AsyncRecognizerStub(chan)
            
            # Authentication with an IAM token
            response = stub.GetRecognition(request, metadata=[('authorization', 'Bearer <IAM_token>')])
            
            # Authentication with an API key
            # response = stub.GetRecognition(request, metadata=[('authorization', 'Api-Key <API_key>')])
            
            print(list(response))
            
  7. Run this query:

    python3 results.py
            
    Result example
    [session_uuid {
              uuid: "df49eaa2-25a55218-ae967fa1-********"
              user_request_id: "f8dkup42nmhk********"
            }
            audio_cursors {
              received_data_ms: 6600
              partial_time_ms: 6600
              final_time_ms: 6600
            }
            response_wall_time_ms: 204
            final {
              alternatives {
                words {
                  text: "I"
                  start_time_ms: 380
                  end_time_ms: 420
                }
                words {
                  "text": "Yandex"
                  start_time_ms: 539
                  end_time_ms: 919
                }
                words {
                  "text": "SpeechKit"
                  start_time_ms: 960
                  end_time_ms: 1719
                }
                words {
                  text: "I"
                  start_time_ms: 2159
                  end_time_ms: 2200
                }
                words {
                  "text": "can"
                  start_time_ms: 2260
                  end_time_ms: 2440
                }
                words {
                  text: "turn"
                  start_time_ms: 2520
                  end_time_ms: 3000
                }
                words {
                  "text": "any"
                  start_time_ms: 3060
                  end_time_ms: 3320
                }
                words {
                  "text": "text"
                  start_time_ms: 3419
                  end_time_ms: 3740
                }
                words {
                  "text": "into"
                  start_time_ms: 3780
                  end_time_ms: 3800
                }
                words {
                  "text": "speech"
                  start_time_ms: 3860
                  end_time_ms: 4279
                }
                words {
                  "text": "now"
                  start_time_ms: 4680
                  end_time_ms: 5240
                }
                words {
                  "text": "you"
                  start_time_ms: 5339
                  end_time_ms: 5380
                }
                words {
                  "text": "can"
                  start_time_ms: 5460
                  end_time_ms: 5766
                }
                words {
                  text: "too"
                  start_time_ms: 5920
                  end_time_ms: 6393
                }
                text: "I'm Yandex SpeechKit I can turn any text into speech now you can too"
                end_time_ms: 6600
              }
              channel_tag: "0"
            }
            channel_tag: "0"
            , session_uuid {
              uuid: "df49eaa2-25a55218-ae967fa1-********"
              user_request_id: "f8dkup42nmhk********"
            }
            audio_cursors {
              received_data_ms: 6600
              partial_time_ms: 6600
              final_time_ms: 6600
            }
            response_wall_time_ms: 204
            final_refinement {
              normalized_text {
                alternatives {
                  words {
                    text: "I"
                    start_time_ms: 380
                    end_time_ms: 420
                  }
                  words {
                    "text": "Yandex"
                    start_time_ms: 539
                    end_time_ms: 919
                  }
                  words {
                    "text": "SpeechKit"
                    start_time_ms: 960
                    end_time_ms: 1719
                  }
                  words {
                    text: "I"
                    start_time_ms: 2159
                    end_time_ms: 2200
                  }
                  words {
                    "text": "can"
                    start_time_ms: 2260
                    end_time_ms: 2440
                  }
                  words {
                    text: "turn"
                    start_time_ms: 2520
                    end_time_ms: 3000
                  }
                  words {
                    "text": "any"
                    start_time_ms: 3060
                    end_time_ms: 3320
                  }
                  words {
                    "text": "text"
                    start_time_ms: 3419
                    end_time_ms: 3740
                  }
                  words {
                    "text": "into"
                    start_time_ms: 3780
                    end_time_ms: 3800
                  }
                  words {
                    "text": "speech"
                    start_time_ms: 3860
                    end_time_ms: 4279
                  }
                  words {
                    "text": "now"
                    start_time_ms: 4680
                    end_time_ms: 5240
                  }
                  words {
                    "text": "you"
                    start_time_ms: 5339
                    end_time_ms: 5380
                  }
                  words {
                    "text": "can"
                    start_time_ms: 5460
                    end_time_ms: 5766
                  }
                  words {
                    text: "too"
                    start_time_ms: 5920
                    end_time_ms: 6393
                  }
                  text: "I'm Yandex SpeechKit I can turn any text into speech now you can too"
                  end_time_ms: 6600
                }
                channel_tag: "0"
              }
            }
            channel_tag: "0"
            , session_uuid {
              uuid: "df49eaa2-25a55218-ae967fa1-********"
              user_request_id: "f8dkup42nmhk********"
            }
            audio_cursors {
              received_data_ms: 6600
              partial_time_ms: 6600
              final_time_ms: 6600
              eou_time_ms: 6600
            }
            response_wall_time_ms: 204
            eou_update {
              time_ms: 6600
            }
            channel_tag: "0"
            ]