Speech synthesis in the API v3

With the SpeechKit API v3, you can synthesize speech from text in TTS markup to a WAV file.

The example uses the following synthesis parameters:

To convert and record the result, you will need the grpcio-tools and pydub packages and the FFmpeg utility.

Authentication is performed under a service account using an API key or IAM token. To learn more about SpeechKit API authentication, see Authentication with the SpeechKit API.

To implement an example:

  1. Create a service account to work with the SpeechKit API.

  2. Assign the service account the ai.speechkit-tts.user role or higher for the folder where it was created.

  3. Get an API key or IAM token for your service account.

  4. Create a client application:

    1. Clone the Yandex Cloud API repository:

      git clone https://github.com/yandex-cloud/cloudapi
              
    2. Install the grpcio-tools and pydub packages using the pip package manager:

      pip install grpcio-tools && \
              pip install pydub
              

      You need the grpcio-tools package to generate the interface code for the API v3 synthesis client. You need the pydub package to process the resulting audio files.

    3. Download FFmpeg for correct operation of the pydub package. Add the path to the directory with the executable to the PATH variable. To do this, run this command:

      export PATH=$PATH:<path_to_directory_with_FFmpeg_executable>
              
    4. Go to the directory hosting the cloned Yandex Cloud API repository, create a directory named output, and generate the client interface code there:

      cd <path_to_cloudapi_directory>
              mkdir output
              python3 -m grpc_tools.protoc -I . -I third_party/googleapis \
                --python_out=output \
                --grpc_python_out=output \
                google/api/http.proto \
                google/api/annotations.proto \
                yandex/cloud/api/operation.proto \
                google/rpc/status.proto \
                yandex/cloud/operation/operation.proto \
                yandex/cloud/validation.proto \
                yandex/cloud/ai/tts/v3/tts_service.proto \
                yandex/cloud/ai/tts/v3/tts.proto
              

      This will create the tts_pb2.py, tts_pb2_grpc.py, tts_service_pb2.py, and tts_service_pb2_grpc.py client interface files, as well as dependency files, in the output directory.

    5. Create a file (e.g., test.py) in the root of the output directory, and add the following code to it:

      import io
              import grpc
              import pydub
              import argparse
              
              import yandex.cloud.ai.tts.v3.tts_pb2 as tts_pb2
              import yandex.cloud.ai.tts.v3.tts_service_pb2_grpc as tts_service_pb2_grpc
              
              # Specify the synthesis settings.
              # Provide api_key instead of iam_token when authenticating with an API key
              #def synthesize(api_key, text) -> pydub.AudioSegment:
              def synthesize(iam_token, text) -> pydub.AudioSegment:
                  request = tts_pb2.UtteranceSynthesisRequest(
                      text=text,
                      output_audio_spec=tts_pb2.AudioFormatOptions(
                          container_audio=tts_pb2.ContainerAudio(
                              container_audio_type=tts_pb2.ContainerAudio.WAV
                          )
                      ),
                      # Synthesis parameters
                      hints=[
                        tts_pb2.Hints(voice= 'alexander'), # (Optional) Specify the voice. The default value is `marina`
                        tts_pb2.Hints(role = 'good'), # (Optional) Specify the role only if applicable for this voice
                        tts_pb2.Hints(speed=1.1), # (Optional) Specify synthesis speed
                      ],
              
                      loudness_normalization_type=tts_pb2.UtteranceSynthesisRequest.LUFS
                  )
              
                  # Establish a connection with the server.
                  cred = grpc.ssl_channel_credentials()
                  channel = grpc.secure_channel('tts.api.cloud.yandex.net:443', cred)
                  stub = tts_service_pb2_grpc.SynthesizerStub(channel)
              
                  # Send data for synthesis.
                  it = stub.UtteranceSynthesis(request, metadata=(
              
                  # Parameters for authentication with an IAM token
                      ('authorization', f'Bearer {iam_token}'),
                  # Parameters for authentication with an API key as a service account
                  #   ('authorization', f'Api-Key {api_key}'),
                  ))
              
                  # Create an audio file out of chunks.
                  try:
                      audio = io.BytesIO()
                      for response in it:
                          audio.write(response.audio_chunk.data)
                      audio.seek(0)
                      return pydub.AudioSegment.from_wav(audio)
                  except grpc._channel._Rendezvous as err:
                      print(f'Error code {err._state.code}, message: {err._state.details}')
                      raise err
              
              
              if __name__ == '__main__':
                  parser = argparse.ArgumentParser()
                  parser.add_argument('--token', required=True, help='IAM token or API key')
                  parser.add_argument('--text', required=True, help='Text for synthesis')
                  parser.add_argument('--output', required=True, help='Output file')
                  args = parser.parse_args()
              
                  audio = synthesize(args.token, args.text)
                  with open(args.output, 'wb') as fp:
                      audio.export(fp, format='wav')
              
    6. Execute the file from the previous step:

      export IAM_TOKEN=<service_account_IAM_token>
              export TEXT='I'm Yandex Speech+Kit. I can turn any text into speech. Now y+ou can, too!'
              python3 output/test.py \
                --token ${IAM_TOKEN} \
                --output speech.wav \
                --text ${TEXT}
              

      Where:

      • IAM_TOKEN: Service account IAM token. If you use an API key for authentication under a service account, change the Python script and the program call.
      • TEXT: Text for synthesis in TTS markup.
      • --output: Name of the file for the audio.

      As a result, a file named speech.wav with synthesized speech will be created in the cloudapi directory.

    1. Install the dependencies:

      sudo apt update && sudo apt install --yes default-jdk maven
              
    2. Clone the repository with a Java application configuration:

      git clone https://github.com/yandex-cloud-examples/yc-speechkit-tts-java
              
    3. Go to the repository directory:

      cd yc-speechkit-tts-java
              
    4. Compile a project in this directory:

      mvn clean install
              
    5. Go to the target directory you created:

      cd target
              
    6. Specify the service account's API key and text to synthesize:

      export API_KEY=<API_key> && \
              export TEXT='I'm Yandex Speech+Kit. I can turn any text into speech. Now y+ou can, too!'
              
    7. Run the Java script for speech synthesis:

      java -cp speechkit_examples-1.0-SNAPSHOT.jar yandex.cloud.speechkit.examples.TtsV3Client ${TEXT}
              

      As a result, the result.wav audio file should appear in the target directory. It contains speech recorded from the TEXT environment variable.

See also