Connectors for creating Vector Store search indexes

You can create Vector Store search indexes based on files from various sources. With this purpose in mind, Yandex Cloud ML SDK offers a command-line utility called Vector Store CLI, which supports the following data sources:

Local file system.
Atlassian Confluence.
Yandex Object Storage and other S3-compatible storages.
MediaWiki-based systems.

Installation

Install the base package and optional extra features depending on the data sources you plan to work with:

# Basic installation with support for local files and Atlassian Confluence
        pip install yandex-ai-studio-sdk
        
        # With support for Amazon S3 and Object Storage
        pip install "yandex-ai-studio-sdk[cli-s3]"
        
        # With support for MediaWiki
        pip install "yandex-ai-studio-sdk[cli-wiki]"
        
        # All optional features
        pip install "yandex-ai-studio-sdk[cli-wiki,cli-s3]"

Authentication

The Vector Store CLI supports the following authentication methods:

The --auth flag and an IAM token or an API key. This flag is not required if the user has authenticated via another method.

Note

If both token types are specified, YC_API_KEY takes precedence over YC_IAM_TOKEN.
If you have the Yandex Cloud CLI installed and configured, the tool automatically uses its configuration.
When running on a Compute Cloud VM, the tool can use the VM metadata for automatic authentication.

Usage

You can work with the Vector Store CLI using commands formatted as follows:

yandex-ai-studio vector-stores <data_source_type> [<parameters>] [<data_path>]

Where:

<data_source_type>: Data source name. The possible values are:
- local: Local file system.
- confluence: Atlassian Confluence.
- s3: Object Storage and other S3-compatible storages.
- wiki: MediaWiki-based systems.
<parameters>: Authentication data and additional indexing parameters.
<data_path>: Path to the data for indexing.

The way you work with the utility varies depending on your data source:

Local files

Atlassian Confluence

S3-compatible storages

MediaWiki

The local data source does not support folders. Use full file paths or ShellGlobbing, e.g., *.txt covers all files with the .txt extension.

Parameters

Parameter	Description
`--max-file-size INT`	Skip files larger than the specified size in bytes

Use cases

# Indexing a single file
        yandex-ai-studio vector-stores local report.pdf
        
        # Indexing multiple files
        yandex-ai-studio vector-stores local docs/intro.txt docs/guide.md
        
        # Using ShellGlobbing to include all `.txt` and `.md` files
        yandex-ai-studio vector-stores local sample_docs/*.txt sample_docs/*.md

The confluence data source creates a search index based on the URLs of Atlassian Confluence pages. The URL must contain the page ID. For example:

Cloud storage: https://your-domain.atlassian.net/wiki/spaces/SPACE/pages/123456/Page+Title
Local storage: https://confluence.example.com/pages/viewpage.action?pageId=123456

Warning

URLs in /display/SPACE/Page+Title format are not supported.

To find out the page ID:

Cloud storage. Retrieve from the URL after /pages/.
Local storage. With the page open, click and select Page information. The ID will be stated in the ?pageId= parameter.

Parameters

Parameter	Environment variable	Default	Description
`--page-url URL`	—	—	Page URL. This is a required setting. You can specify more than one
`--base-url URL`	—	Automatically	Confluence base URL. Derived automatically from the first `--page-url`
`--username TEXT`	`CONFLUENCE_USERNAME`	—	User's email address. Required for local storages
`--api-token TEXT`	`CONFLUENCE_API_TOKEN`	—	API token. Required for local storages
`--export-format TEXT`	—	`pdf`	Export format: `pdf`, `html`, or `markdown`
`--no-verify`	—	`false`	Flag to disable the SSL certificate check

Note

For cloud storages, use an email and API token. For local storages, use local credentials unless configured otherwise.

Use cases

# Cloud storage without authentication
        yandex-ai-studio vector-stores confluence \
          --page-url "https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=34840000"
        
        # Multiple pages
        yandex-ai-studio vector-stores confluence \
          --page-url "https://your-domain.atlassian.net/wiki/spaces/ENG/pages/111/Overview" \
          --page-url "https://your-domain.atlassian.net/wiki/spaces/ENG/pages/222/Architecture"
        
        # Local storage with environment variables
        export CONFLUENCE_USERNAME=alice@example.com
        export CONFLUENCE_API_TOKEN=ATATT3xFf********
        yandex-ai-studio vector-stores confluence \
          --page-url "https://your-domain.atlassian.net/wiki/spaces/ENG/pages/123456/Design"
        
        # Export in HTML format
        yandex-ai-studio vector-stores confluence \
          --page-url "https://your-domain.atlassian.net/wiki/spaces/ENG/pages/123456/Design" \
          --export-format html

Accessing an s3 data source creates a search index from an S3-compatible bucket.

Parameters

Parameter	Environment variable	Description
`--prefix TEXT`	—	Filtering objects by folder path (prefix)
`--endpoint-url URL`	—	Custom S3 endpoint, e.g., `https://storage.yandexcloud.net` for Object Storage
`--aws-access-key-id TEXT`	`AWS_ACCESS_KEY_ID`	Key ID
`--aws-secret-access-key TEXT`	`AWS_SECRET_ACCESS_KEY`	Secret key
`--region-name TEXT`	`AWS_DEFAULT_REGION`	AWS region name
`--include-pattern GLOB`	—	Including object keys matching the pattern. You can specify more than one
`--exclude-pattern GLOB`	—	Excluding object keys matching the pattern. You can specify more than one
`--max-file-size INT`	—	Skip files larger than the specified size in bytes

Note

If credentials are not provided, the tool will attempt to use the AWS CLI configuration, Yandex Cloud CLI configuration, or the Compute Cloud VM metadata.

Use cases

# Indexing the entire bucket
        yandex-ai-studio vector-stores s3 <bucket_name>
        
        # Indexing only a specific prefix
        yandex-ai-studio vector-stores s3 <bucket_name> --prefix docs/
        
        # Including PDF files only
        yandex-ai-studio vector-stores s3 <bucket_name> --include-pattern "*.pdf"
        
        # Usage Object Storage
        yandex-ai-studio vector-stores s3 <bucket_name> \
          --endpoint-url https://storage.yandexcloud.net \
          --region-name ru-central1

Accessing an wiki data source creates a search index from pages residing in MediaWiki-based storages (e.g., Wikipedia).

Page URLs must be valid storage URLs with paths containing /wiki/. You can specify more than one URL.

For public storages, authentication is optional.

Parameters

Parameters	Environment variable	Default	Description
`--username TEXT`	`WIKI_USERNAME`	—	Username. Required for non-public storages
`--password TEXT`	`WIKI_PASSWORD`	—	User password
`--export-format TEXT`	—	`text`	Output format: `text`, `html`, or `markdown`

Use cases

# Indexing a single Wikipedia page
        yandex-ai-studio vector-stores wiki https://en.wikipedia.org/wiki/Machine_learning
        
        # Indexing multiple pages
        yandex-ai-studio vector-stores wiki \
          https://en.wikipedia.org/wiki/Machine_learning \
          https://en.wikipedia.org/wiki/Neural_network \
          https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture)
        
        # Exporting content in Markdown format
        yandex-ai-studio vector-stores wiki \
          "https://en.wikipedia.org/wiki/Python_(programming_language)" \
          --export-format markdown
        
        # Accessing a private wiki using account credentials
        yandex-ai-studio vector-stores wiki \
          https://wiki.example.com/wiki/Internal_docs \
          --username alice \
          --password secret

Common parameters

Common parameters are available for all storage types.

Connection

Parameter	Environment variable	Description
`--folder-id TEXT`	`YC_FOLDER_ID`	Yandex Cloud folder ID. Required
`--auth TEXT`	`YC_API_KEY` or `YC_IAM_TOKEN`	Authentication token
`--endpoint URL`	—	Overriding a standard API endpoint

Index settings

Option	Default	Description
`--name TEXT`	—	Name of the new search index
`--metadata KEY=VALUE`	—	Adding metadata. Up to 16 `key=value` pairs
`--expires-after-days INT`	—	Index lifetime in days
`--expires-after-anchor TEXT`	—	Starting point for index lifetime. The possible values are: `created_at`: Creation date `last_active_at`: Last activity
`--max-chunk-size-tokens INT`	`800`	Maximum number of tokens per text cell
`--chunk-overlap-tokens INT`	`400`	Number of duplicate tokens between adjacent cells
`--poll-timeout INT`	`3600`	Maximum index creation timeout, in seconds

Upload settings

Option	Default	Description
`--max-concurrent-uploads INT`	`4`	Maximum number of concurrent file uploads
`--skip-on-error`	`false`	Continue processing on file upload error
`--file-expires-after-seconds INT`	—	Lifetime of uploaded files, in seconds
`--file-expires-after-anchor TEXT`	—	Starting point for file lifetime. The possible values are: `created_at`: Creation date `last_active_at`: Last activity

Output settings

Option	Default	Description
`-v`	—	Logging level: `INFO`
`-vv`	—	Logging level: `DEBUG`
`--format TEXT`	`text`	Output format: `text` or `json`

Output

If successful, the command outputs the index ID and name.

Text output example:

Search index created successfully!
        Search Index ID: fvt-hj87lxe3********
        Name: my-index

JSON output example:

{
          "status": "success",
          "folder_id": "b1go3el0d8fs********",
          "search_index": {
            "id": "fvt-hj87lxe3********",
            "name": "my-index"
          }
        }

Note

By default, error messages are returned in standard output streams. If --format json is used, error messages are returned in a structured JSON format.

Was the article helpful?

Vector Store search indexes

File search tool

Connectors for creating Vector Store search indexes

InstallationInstallation

AuthenticationAuthentication

UsageUsage

Common parametersCommon parameters

ConnectionConnection

Index settingsIndex settings

Upload settingsUpload settings

Output settingsOutput settings

OutputOutput

See alsoSee also

Was the article helpful?

Installation

Authentication

Usage

Common parameters

Connection

Index settings

Upload settings

Output settings

Output

See also