Connectors for creating Vector Store search indexes

You can create Vector Store search indexes based on files from various sources. With this purpose in mind, Yandex Cloud ML SDK offers a command-line utility called Vector Store CLI, which supports the following data sources:

Installation

Install the base package and optional extra features depending on the data sources you plan to work with:

# Basic installation with support for local files and Atlassian Confluence
        pip install yandex-ai-studio-sdk
        
        # With support for Amazon S3 and Object Storage
        pip install "yandex-ai-studio-sdk[cli-s3]"
        
        # With support for MediaWiki
        pip install "yandex-ai-studio-sdk[cli-wiki]"
        
        # All optional features
        pip install "yandex-ai-studio-sdk[cli-wiki,cli-s3]"
        

Authentication

The Vector Store CLI supports the following authentication methods:

  1. The --auth flag and an IAM token or an API key. This flag is not required if the user has authenticated via another method.

    Note

    If both token types are specified, YC_API_KEY takes precedence over YC_IAM_TOKEN.

  2. If you have the Yandex Cloud CLI installed and configured, the tool automatically uses its configuration.

  3. When running on a Compute Cloud VM, the tool can use the VM metadata for automatic authentication.

Usage

You can work with the Vector Store CLI using commands formatted as follows:

yandex-ai-studio vector-stores <data_source_type> [<parameters>] [<data_path>]
        

Where:

  • <data_source_type>: Data source name. The possible values are:

    • local: Local file system.
    • confluence: Atlassian Confluence.
    • s3: Object Storage and other S3-compatible storages.
    • wiki: MediaWiki-based systems.
  • <parameters>: Authentication data and additional indexing parameters.

  • <data_path>: Path to the data for indexing.

The way you work with the utility varies depending on your data source:

The local data source does not support folders. Use full file paths or ShellGlobbing, e.g., *.txt covers all files with the .txt extension.

Parameters

Parameter Description
--max-file-size INT Skip files larger than the specified size in bytes

Use cases

# Indexing a single file
        yandex-ai-studio vector-stores local report.pdf
        
        # Indexing multiple files
        yandex-ai-studio vector-stores local docs/intro.txt docs/guide.md
        
        # Using ShellGlobbing to include all `.txt` and `.md` files
        yandex-ai-studio vector-stores local sample_docs/*.txt sample_docs/*.md
        

The confluence data source creates a search index based on the URLs of Atlassian Confluence pages. The URL must contain the page ID. For example:

  • Cloud storage: https://your-domain.atlassian.net/wiki/spaces/SPACE/pages/123456/Page+Title
  • Local storage: https://confluence.example.com/pages/viewpage.action?pageId=123456

Warning

URLs in /display/SPACE/Page+Title format are not supported.

To find out the page ID:

  • Cloud storage. Retrieve from the URL after /pages/.
  • Local storage. With the page open, click image and select Page information. The ID will be stated in the ?pageId= parameter.

Parameters

Parameter Environment variable Default Description
--page-url URL Page URL. This is a required setting. You can specify more than one
--base-url URL Automatically Confluence base URL. Derived automatically from the first --page-url
--username TEXT CONFLUENCE_USERNAME User's email address. Required for local storages
--api-token TEXT CONFLUENCE_API_TOKEN API token. Required for local storages
--export-format TEXT pdf Export format: pdf, html, or markdown
--no-verify false Flag to disable the SSL certificate check

Note

For cloud storages, use an email and API token. For local storages, use local credentials unless configured otherwise.

Use cases

# Cloud storage without authentication
        yandex-ai-studio vector-stores confluence \
          --page-url "https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=34840000"
        
        # Multiple pages
        yandex-ai-studio vector-stores confluence \
          --page-url "https://your-domain.atlassian.net/wiki/spaces/ENG/pages/111/Overview" \
          --page-url "https://your-domain.atlassian.net/wiki/spaces/ENG/pages/222/Architecture"
        
        # Local storage with environment variables
        export CONFLUENCE_USERNAME=alice@example.com
        export CONFLUENCE_API_TOKEN=ATATT3xFf********
        yandex-ai-studio vector-stores confluence \
          --page-url "https://your-domain.atlassian.net/wiki/spaces/ENG/pages/123456/Design"
        
        # Export in HTML format
        yandex-ai-studio vector-stores confluence \
          --page-url "https://your-domain.atlassian.net/wiki/spaces/ENG/pages/123456/Design" \
          --export-format html
        

Accessing an s3 data source creates a search index from an S3-compatible bucket.

Parameters

Parameter Environment variable Description
--prefix TEXT Filtering objects by folder path (prefix)
--endpoint-url URL Custom S3 endpoint,
e.g., https://storage.yandexcloud.net for Object Storage
--aws-access-key-id TEXT AWS_ACCESS_KEY_ID Key ID
--aws-secret-access-key TEXT AWS_SECRET_ACCESS_KEY Secret key
--region-name TEXT AWS_DEFAULT_REGION AWS region name
--include-pattern GLOB Including object keys matching the pattern. You can specify more than one
--exclude-pattern GLOB Excluding object keys matching the pattern. You can specify more than one
--max-file-size INT Skip files larger than the specified size in bytes

Note

If credentials are not provided, the tool will attempt to use the AWS CLI configuration, Yandex Cloud CLI configuration, or the Compute Cloud VM metadata.

Use cases

# Indexing the entire bucket
        yandex-ai-studio vector-stores s3 <bucket_name>
        
        # Indexing only a specific prefix
        yandex-ai-studio vector-stores s3 <bucket_name> --prefix docs/
        
        # Including PDF files only
        yandex-ai-studio vector-stores s3 <bucket_name> --include-pattern "*.pdf"
        
        # Usage Object Storage
        yandex-ai-studio vector-stores s3 <bucket_name> \
          --endpoint-url https://storage.yandexcloud.net \
          --region-name ru-central1
        

Accessing an wiki data source creates a search index from pages residing in MediaWiki-based storages (e.g., Wikipedia).

Page URLs must be valid storage URLs with paths containing /wiki/. You can specify more than one URL.

For public storages, authentication is optional.

Parameters

Parameters Environment variable Default Description
--username TEXT WIKI_USERNAME Username. Required for non-public storages
--password TEXT WIKI_PASSWORD User password
--export-format TEXT text Output format: text, html, or markdown

Use cases

# Indexing a single Wikipedia page
        yandex-ai-studio vector-stores wiki https://en.wikipedia.org/wiki/Machine_learning
        
        # Indexing multiple pages
        yandex-ai-studio vector-stores wiki \
          https://en.wikipedia.org/wiki/Machine_learning \
          https://en.wikipedia.org/wiki/Neural_network \
          https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture)
        
        # Exporting content in Markdown format
        yandex-ai-studio vector-stores wiki \
          "https://en.wikipedia.org/wiki/Python_(programming_language)" \
          --export-format markdown
        
        # Accessing a private wiki using account credentials
        yandex-ai-studio vector-stores wiki \
          https://wiki.example.com/wiki/Internal_docs \
          --username alice \
          --password secret
        

Common parameters

Common parameters are available for all storage types.

Connection

Parameter Environment variable Description
--folder-id TEXT YC_FOLDER_ID Yandex Cloud folder ID. Required
--auth TEXT YC_API_KEY or YC_IAM_TOKEN Authentication token
--endpoint URL Overriding a standard API endpoint

Index settings

Option Default Description
--name TEXT Name of the new search index
--metadata KEY=VALUE Adding metadata. Up to 16 key=value pairs
--expires-after-days INT Index lifetime in days
--expires-after-anchor TEXT Starting point for index lifetime. The possible values are:
created_at: Creation date
last_active_at: Last activity
--max-chunk-size-tokens INT 800 Maximum number of tokens per text cell
--chunk-overlap-tokens INT 400 Number of duplicate tokens between adjacent cells
--poll-timeout INT 3600 Maximum index creation timeout, in seconds

Upload settings

Option Default Description
--max-concurrent-uploads INT 4 Maximum number of concurrent file uploads
--skip-on-error false Continue processing on file upload error
--file-expires-after-seconds INT Lifetime of uploaded files, in seconds
--file-expires-after-anchor TEXT Starting point for file lifetime. The possible values are:
created_at: Creation date
last_active_at: Last activity

Output settings

Option Default Description
-v Logging level: INFO
-vv Logging level: DEBUG
--format TEXT text Output format: text or json

Output

If successful, the command outputs the index ID and name.

Text output example:

Search index created successfully!
        Search Index ID: fvt-hj87lxe3********
        Name: my-index
        

JSON output example:

{
          "status": "success",
          "folder_id": "b1go3el0d8fs********",
          "search_index": {
            "id": "fvt-hj87lxe3********",
            "name": "my-index"
          }
        }
        

Note

By default, error messages are returned in standard output streams. If --format json is used, error messages are returned in a structured JSON format.

See also