Connectors for creating Vector Store search indexes
You can create Vector Store search indexes based on files from various sources. With this purpose in mind, Yandex Cloud ML SDK offers a command-line utility called Vector Store CLI, which supports the following data sources:
- Local file system.
- Atlassian Confluence.
- Yandex Object Storage and other S3-compatible storages.
- MediaWiki-based systems.
Installation
Install the base package and optional extra features depending on the data sources you plan to work with:
# Basic installation with support for local files and Atlassian Confluence
pip install yandex-ai-studio-sdk
# With support for Amazon S3 and Object Storage
pip install "yandex-ai-studio-sdk[cli-s3]"
# With support for MediaWiki
pip install "yandex-ai-studio-sdk[cli-wiki]"
# All optional features
pip install "yandex-ai-studio-sdk[cli-wiki,cli-s3]"
Authentication
The Vector Store CLI supports the following authentication methods:
-
The
--authflag and an IAM token or an API key. This flag is not required if the user has authenticated via another method.Note
If both token types are specified,
YC_API_KEYtakes precedence overYC_IAM_TOKEN. -
If you have the Yandex Cloud CLI installed and configured, the tool automatically uses its configuration.
-
When running on a Compute Cloud VM, the tool can use the VM metadata for automatic authentication.
Usage
You can work with the Vector Store CLI using commands formatted as follows:
yandex-ai-studio vector-stores <data_source_type> [<parameters>] [<data_path>]
Where:
-
<data_source_type>: Data source name. The possible values are:local: Local file system.confluence: Atlassian Confluence.s3: Object Storage and other S3-compatible storages.wiki: MediaWiki-based systems.
-
<parameters>: Authentication data and additional indexing parameters. -
<data_path>: Path to the data for indexing.
The way you work with the utility varies depending on your data source:
The local data source does not support folders. Use full file paths or ShellGlobbing, e.g., *.txt covers all files with the .txt extension.
Parameters
| Parameter | Description |
|---|---|
--max-file-size INT |
Skip files larger than the specified size in bytes |
Use cases
# Indexing a single file
yandex-ai-studio vector-stores local report.pdf
# Indexing multiple files
yandex-ai-studio vector-stores local docs/intro.txt docs/guide.md
# Using ShellGlobbing to include all `.txt` and `.md` files
yandex-ai-studio vector-stores local sample_docs/*.txt sample_docs/*.md
The confluence data source creates a search index based on the URLs of Atlassian Confluence pages. The URL must contain the page ID. For example:
- Cloud storage:
https://your-domain.atlassian.net/wiki/spaces/SPACE/pages/123456/Page+Title - Local storage:
https://confluence.example.com/pages/viewpage.action?pageId=123456
Warning
URLs in /display/SPACE/Page+Title format are not supported.
To find out the page ID:
- Cloud storage. Retrieve from the URL after
/pages/. - Local storage. With the page open, click
and select Page information. The ID will be stated in the
?pageId=parameter.
Parameters
| Parameter | Environment variable | Default | Description |
|---|---|---|---|
--page-url URL |
— | — | Page URL. This is a required setting. You can specify more than one |
--base-url URL |
— | Automatically | Confluence base URL. Derived automatically from the first --page-url |
--username TEXT |
CONFLUENCE_USERNAME |
— | User's email address. Required for local storages |
--api-token TEXT |
CONFLUENCE_API_TOKEN |
— | API token. Required for local storages |
--export-format TEXT |
— | pdf |
Export format: pdf, html, or markdown |
--no-verify |
— | false |
Flag to disable the SSL certificate check |
Note
For cloud storages, use an email and API token. For local storages, use local credentials unless configured otherwise.
Use cases
# Cloud storage without authentication
yandex-ai-studio vector-stores confluence \
--page-url "https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=34840000"
# Multiple pages
yandex-ai-studio vector-stores confluence \
--page-url "https://your-domain.atlassian.net/wiki/spaces/ENG/pages/111/Overview" \
--page-url "https://your-domain.atlassian.net/wiki/spaces/ENG/pages/222/Architecture"
# Local storage with environment variables
export CONFLUENCE_USERNAME=alice@example.com
export CONFLUENCE_API_TOKEN=ATATT3xFf********
yandex-ai-studio vector-stores confluence \
--page-url "https://your-domain.atlassian.net/wiki/spaces/ENG/pages/123456/Design"
# Export in HTML format
yandex-ai-studio vector-stores confluence \
--page-url "https://your-domain.atlassian.net/wiki/spaces/ENG/pages/123456/Design" \
--export-format html
Accessing an s3 data source creates a search index from an S3-compatible bucket.
Parameters
| Parameter | Environment variable | Description |
|---|---|---|
--prefix TEXT |
— | Filtering objects by folder path (prefix) |
--endpoint-url URL |
— | Custom S3 endpoint, e.g., https://storage.yandexcloud.net for Object Storage |
--aws-access-key-id TEXT |
AWS_ACCESS_KEY_ID |
Key ID |
--aws-secret-access-key TEXT |
AWS_SECRET_ACCESS_KEY |
Secret key |
--region-name TEXT |
AWS_DEFAULT_REGION |
AWS region name |
--include-pattern GLOB |
— | Including object keys matching the pattern. You can specify more than one |
--exclude-pattern GLOB |
— | Excluding object keys matching the pattern. You can specify more than one |
--max-file-size INT |
— | Skip files larger than the specified size in bytes |
Note
If credentials are not provided, the tool will attempt to use the AWS CLI configuration, Yandex Cloud CLI configuration, or the Compute Cloud VM metadata.
Use cases
# Indexing the entire bucket
yandex-ai-studio vector-stores s3 <bucket_name>
# Indexing only a specific prefix
yandex-ai-studio vector-stores s3 <bucket_name> --prefix docs/
# Including PDF files only
yandex-ai-studio vector-stores s3 <bucket_name> --include-pattern "*.pdf"
# Usage Object Storage
yandex-ai-studio vector-stores s3 <bucket_name> \
--endpoint-url https://storage.yandexcloud.net \
--region-name ru-central1
Accessing an wiki data source creates a search index from pages residing in MediaWiki-based storages (e.g., Wikipedia).
Page URLs must be valid storage URLs with paths containing /wiki/. You can specify more than one URL.
For public storages, authentication is optional.
Parameters
| Parameters | Environment variable | Default | Description |
|---|---|---|---|
--username TEXT |
WIKI_USERNAME |
— | Username. Required for non-public storages |
--password TEXT |
WIKI_PASSWORD |
— | User password |
--export-format TEXT |
— | text |
Output format: text, html, or markdown |
Use cases
# Indexing a single Wikipedia page
yandex-ai-studio vector-stores wiki https://en.wikipedia.org/wiki/Machine_learning
# Indexing multiple pages
yandex-ai-studio vector-stores wiki \
https://en.wikipedia.org/wiki/Machine_learning \
https://en.wikipedia.org/wiki/Neural_network \
https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture)
# Exporting content in Markdown format
yandex-ai-studio vector-stores wiki \
"https://en.wikipedia.org/wiki/Python_(programming_language)" \
--export-format markdown
# Accessing a private wiki using account credentials
yandex-ai-studio vector-stores wiki \
https://wiki.example.com/wiki/Internal_docs \
--username alice \
--password secret
Common parameters
Common parameters are available for all storage types.
Connection
| Parameter | Environment variable | Description |
|---|---|---|
--folder-id TEXT |
YC_FOLDER_ID |
Yandex Cloud folder ID. Required |
--auth TEXT |
YC_API_KEY or YC_IAM_TOKEN |
Authentication token |
--endpoint URL |
— | Overriding a standard API endpoint |
Index settings
| Option | Default | Description |
|---|---|---|
--name TEXT |
— | Name of the new search index |
--metadata KEY=VALUE |
— | Adding metadata. Up to 16 key=value pairs |
--expires-after-days INT |
— | Index lifetime in days |
--expires-after-anchor TEXT |
— | Starting point for index lifetime. The possible values are:created_at: Creation datelast_active_at: Last activity |
--max-chunk-size-tokens INT |
800 |
Maximum number of tokens per text cell |
--chunk-overlap-tokens INT |
400 |
Number of duplicate tokens between adjacent cells |
--poll-timeout INT |
3600 |
Maximum index creation timeout, in seconds |
Upload settings
| Option | Default | Description |
|---|---|---|
--max-concurrent-uploads INT |
4 |
Maximum number of concurrent file uploads |
--skip-on-error |
false |
Continue processing on file upload error |
--file-expires-after-seconds INT |
— | Lifetime of uploaded files, in seconds |
--file-expires-after-anchor TEXT |
— | Starting point for file lifetime. The possible values are:created_at: Creation datelast_active_at: Last activity |
Output settings
| Option | Default | Description |
|---|---|---|
-v |
— | Logging level: INFO |
-vv |
— | Logging level: DEBUG |
--format TEXT |
text |
Output format: text or json |
Output
If successful, the command outputs the index ID and name.
Text output example:
Search index created successfully!
Search Index ID: fvt-hj87lxe3********
Name: my-index
JSON output example:
{
"status": "success",
"folder_id": "b1go3el0d8fs********",
"search_index": {
"id": "fvt-hj87lxe3********",
"name": "my-index"
}
}
Note
By default, error messages are returned in standard output streams. If --format json is used, error messages are returned in a structured JSON format.