Dedicated instances

Note

This feature is currently in Preview.

With AI Studio, you can deploy some models on a dedicated instance. Unlike a manual deployment on Yandex Compute Cloud VMs, you do not have to configure the environment or select optimal VM parameters. AI Studio provides stable, reliable, and efficient model inference and monitors its operation automatically.

Dedicated instances have a number of advantages:

  • Guaranteed performance parameters that are not affected by other users' traffic.
  • No additional quotas for requests and parallel generations. The restrictions you get only depend on the instance configuration you select.
  • Optimized model inference for efficient hardware utilization.

Dedicated instances will benefit you if you need to process massive volumes of requests without delays. Dedicated instance is not priced based on the amount of incoming and outgoing tokens: you only pay for its running time.

Dedicated instance models

All deployed models are accessible via an API compatible with OpenAI, ML SDK, and in AI Playground. To deploy a dedicated instance, you need the ai.models.editor role or higher for the folder. To access the model, it is enough to have the ai.languageModels.user role.

Model

Context

License

Qwen 2.5 VL 32B Instruct
Model card

32,768

Apache 2.0 license

Qwen 2.5 7B Instruct
Model card

32,768

Apache 2.0 license

Gemma 3 4B it
Model card

131,072

Gemma Terms of Use

Gemma 3 12B it
Model card

65,536

Gemma Terms of Use

T-pro-it-2.0-FP8
Model card

32,768

Apache 2.0 license

Dedicated instance configurations

Each model may be available for deployment in several configurations: S, M, or L. Each configuration guarantees specific values ​​of TTFT (time to first token), Latency (time it takes to generate a response), and TPS (tokens per second) for requests with different context lengths.

The figure below shows the dependence of latency and the number of tokens processed by the model on the number of parallel generations (Concurrency in the figure): up to a certain point, the more generations the model processes in parallel, the longer the generation will last, and the more tokens will be generated per second.

instances

Use cases