Dedicated instances

Note

This feature is currently in Preview.

With AI Studio, you can deploy some models on a dedicated instance. Unlike a manual deployment on Yandex Compute Cloud VMs, you do not have to configure the environment or select optimal VM parameters. AI Studio provides stable, reliable, and efficient model inference and monitors its operation automatically.

Dedicated instances have a number of advantages:

Guaranteed performance parameters that are not affected by other users' traffic.
No additional quotas for requests and parallel generations. The restrictions you get only depend on the instance configuration you select.
Optimized model inference for efficient hardware utilization.

Dedicated instances will benefit you if you need to process massive volumes of requests without delays. Dedicated instance is not priced based on the amount of incoming and outgoing tokens: you only pay for its running time.

Dedicated instance models

All deployed models are accessible via an API compatible with OpenAI, ML SDK, and in AI Playground. To deploy a dedicated instance, you need the ai.models.editor role or higher for the folder. To access the model, it is enough to have the ai.languageModels.user role.

Model	Context	License
Qwen 2.5 VL 32B Instruct Model card	32,768	Apache 2.0 license
Qwen 2.5 7B Instruct Model card	32,768	Apache 2.0 license
Gemma 3 4B it Model card	131,072	Gemma Terms of Use
Gemma 3 12B it Model card	65,536	Gemma Terms of Use
T-pro-it-2.0-FP8 Model card	32,768	Apache 2.0 license

Dedicated instance configurations

Each model may be available for deployment in several configurations: S, M, or L. Each configuration guarantees specific values of TTFT (time to first token), Latency (time it takes to generate a response), and TPS (tokens per second) for requests with different context lengths.

The figure below shows the dependence of latency and the number of tokens processed by the model on the number of parallel generations (Concurrency in the figure): up to a certain point, the more generations the model processes in parallel, the longer the generation will last, and the more tokens will be generated per second.

instances