Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Configuring cloud storage

When using LanceDB OSS, you can choose where to store your data. The tradeoffs between different storage options are discussed in thestorage concepts guide. This guide shows how to configure LanceDB to use different storage options.

Object Stores

LanceDB OSS supports object stores such as AWS S3 (and compatible stores), Azure Blob Store, and Google Cloud Storage. Which object store to use is determined by the URI scheme of the dataset path.s3:// is used for AWS S3,az:// is used for Azure Blob Storage, andgs:// is used for Google Cloud Storage. These URIs are passed to theconnect function:

AWS S3:

importlancedbdb=lancedb.connect("s3://bucket/path")
importlancedbasync_db=awaitlancedb.connect_async("s3://bucket/path")

Google Cloud Storage:

importlancedbdb=lancedb.connect("gs://bucket/path")
importlancedbasync_db=awaitlancedb.connect_async("gs://bucket/path")

Azure Blob Storage:

importlancedbdb=lancedb.connect("az://bucket/path")
importlancedbasync_db=awaitlancedb.connect_async("az://bucket/path")

Note that for Azure, storage credentials must be configured. Seebelow for more details.

AWS S3:

import*aslancedbfrom"@lancedb/lancedb";constdb=awaitlancedb.connect("s3://bucket/path");

Google Cloud Storage:

import*aslancedbfrom"@lancedb/lancedb";constdb=awaitlancedb.connect("gs://bucket/path");

Azure Blob Storage:

import*aslancedbfrom"@lancedb/lancedb";constdb=awaitlancedb.connect("az://bucket/path");

AWS S3:

constlancedb=require("lancedb");constdb=awaitlancedb.connect("s3://bucket/path");

Google Cloud Storage:

constlancedb=require("lancedb");constdb=awaitlancedb.connect("gs://bucket/path");

Azure Blob Storage:

constlancedb=require("lancedb");constdb=awaitlancedb.connect("az://bucket/path");

In most cases, when running in the respective cloud and permissions are set up correctly, no additional configuration is required. When running outside of the respective cloud, authentication credentials must be provided. Credentials and other configuration options can be set in two ways: first, by setting environment variables. And second, by passing astorage_options object to theconnect function. For example, to increase the request timeout to 60 seconds, you can set theTIMEOUT environment variable to60s:

exportTIMEOUT=60s

If you only want this to apply to one particular connection, you can pass thestorage_options argument when opening the connection:

importlancedbdb=lancedb.connect("s3://bucket/path",storage_options={"timeout":"60s"})
importlancedbasync_db=awaitlancedb.connect_async("s3://bucket/path",storage_options={"timeout":"60s"})
import*aslancedbfrom"@lancedb/lancedb";constdb=awaitlancedb.connect("s3://bucket/path",{storageOptions:{timeout:"60s"}});
constlancedb=require("lancedb");constdb=awaitlancedb.connect("s3://bucket/path",{storageOptions:{timeout:"60s"}});

Getting even more specific, you can set thetimeout for only a particular table:

importlancedbdb=lancedb.connect("s3://bucket/path")table=db.create_table("table",[{"a":1,"b":2}],storage_options={"timeout":"60s"})
importlancedbasync_db=awaitlancedb.connect_async("s3://bucket/path")async_table=awaitasync_db.create_table("table",[{"a":1,"b":2}],storage_options={"timeout":"60s"})

import*aslancedbfrom"@lancedb/lancedb";constdb=awaitlancedb.connect("s3://bucket/path");consttable=db.createTable("table",[{a:1,b:2}],{storageOptions:{timeout:"60s"}});

constlancedb=require("lancedb");constdb=awaitlancedb.connect("s3://bucket/path");consttable=db.createTable("table",[{a:1,b:2}],{storageOptions:{timeout:"60s"}});

Storage option casing

The storage option keys are case-insensitive. Soconnect_timeout andCONNECT_TIMEOUT are the same setting. Usually lowercase is used in thestorage_options argument and uppercase is used for environment variables. In thelancedb Node package, the keys can also be provided incamelCase capitalization. For example,connectTimeout is equivalent toconnect_timeout.

General configuration

There are several options that can be set for all object stores, mostly related to network client configuration.

KeyDescription
allow_httpAllow non-TLS, i.e. non-HTTPS connections. Default:False.
allow_invalid_certificatesSkip certificate validation on HTTPS connections. Default:False.
connect_timeoutTimeout for only the connect phase of a Client. Default:5s.
timeoutTimeout for the entire request, from connection until the response body has finished. Default:30s.
user_agentUser agent string to use in requests.
proxy_urlURL of a proxy server to use for requests. Default:None.
proxy_ca_certificatePEM-formatted CA certificate for proxy connections.
proxy_excludesList of hosts that bypass the proxy. This is a comma-separated list of domains and IP masks. Any subdomain of the provided domain will be bypassed. For example,example.com, 192.168.1.0/24 would bypasshttps://api.example.com,https://www.example.com, and any IP in the range192.168.1.0/24.

AWS S3

To configure credentials for AWS S3, you can use theAWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY, andAWS_SESSION_TOKEN keys. Region can also be set, but it is not mandatory when using AWS.These can be set as environment variables or passed in thestorage_options parameter:

importlancedbdb=lancedb.connect("s3://bucket/path",storage_options={"aws_access_key_id":"my-access-key","aws_secret_access_key":"my-secret-key","aws_session_token":"my-session-token",})
importlancedbasync_db=awaitlancedb.connect_async("s3://bucket/path",storage_options={"aws_access_key_id":"my-access-key","aws_secret_access_key":"my-secret-key","aws_session_token":"my-session-token",})
import*aslancedbfrom"@lancedb/lancedb";constdb=awaitlancedb.connect("s3://bucket/path",{storageOptions:{awsAccessKeyId:"my-access-key",awsSecretAccessKey:"my-secret-key",awsSessionToken:"my-session-token",}});
constlancedb=require("lancedb");constdb=awaitlancedb.connect("s3://bucket/path",{storageOptions:{awsAccessKeyId:"my-access-key",awsSecretAccessKey:"my-secret-key",awsSessionToken:"my-session-token",}});

Alternatively, if you are using AWS SSO, you can use theAWS_PROFILE andAWS_DEFAULT_REGION environment variables.

The following keys can be used as both environment variables or keys in thestorage_options parameter:

KeyDescription
aws_region /regionThe AWS region the bucket is in. This can be automatically detected when using AWS S3, but must be specified for S3-compatible stores.
aws_access_key_id /access_key_idThe AWS access key ID to use.
aws_secret_access_key /secret_access_keyThe AWS secret access key to use.
aws_session_token /session_tokenThe AWS session token to use.
aws_endpoint /endpointThe endpoint to use for S3-compatible stores.
aws_virtual_hosted_style_request /virtual_hosted_style_requestWhether to use virtual hosted-style requests, where the bucket name is part of the endpoint. Meant to be used withaws_endpoint. Default:False.
aws_s3_express /s3_expressWhether to use S3 Express One Zone endpoints. Default:False. See more details below.
aws_server_side_encryptionThe server-side encryption algorithm to use. Must be one of"AES256","aws:kms", or"aws:kms:dsse". Default:None.
aws_sse_kms_key_idThe KMS key ID to use for server-side encryption. If set,aws_server_side_encryption must be"aws:kms" or"aws:kms:dsse".
aws_sse_bucket_key_enabledWhether to use bucket keys for server-side encryption.

Automatic cleanup for failed writes

LanceDB usesmulti-part uploads when writing data to S3 in order to maximize write speed. LanceDB will abort these uploads when it shuts down gracefully, such as when cancelled by keyboard interrupt. However, in the rare case that LanceDB crashes, it is possible that some data will be left lingering in your account. To cleanup this data, we recommend (as AWS themselves do) that you setup a lifecycle rule to delete in-progress uploads after 7 days. See the AWS guide:

Configuring a bucket lifecycle configuration to delete incomplete multipart uploads

AWS IAM Permissions

If a bucket is private, then an IAM policy must be specified to allow access to it. For many development scenarios, using broad permissions such as a PowerUser account is more than sufficient for working with LanceDB. However, in many production scenarios, you may wish to have as narrow as possible permissions.

Forread and write access, LanceDB will need a policy such as:

{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Action":["s3:PutObject","s3:GetObject","s3:DeleteObject"],"Resource":"arn:aws:s3:::<bucket>/<prefix>/*"},{"Effect":"Allow","Action":["s3:ListBucket","s3:GetBucketLocation"],"Resource":"arn:aws:s3:::<bucket>","Condition":{"StringLike":{"s3:prefix":["<prefix>/*"]}}}]}

Forread-only access, LanceDB will need a policy such as:

{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Action":["s3:GetObject"],"Resource":"arn:aws:s3:::<bucket>/<prefix>/*"},{"Effect":"Allow","Action":["s3:ListBucket","s3:GetBucketLocation"],"Resource":"arn:aws:s3:::<bucket>","Condition":{"StringLike":{"s3:prefix":["<prefix>/*"]}}}]}

DynamoDB Commit Store for concurrent writes

By default, S3 does not support concurrent writes. Having two or more processeswriting to the same table at the same time can lead to data corruption. This isbecause S3, unlike other object stores, does not have any atomic put or copyoperation.

To enable concurrent writes, you can configure LanceDB to use a DynamoDB tableas a commit store. This table will be used to coordinate writes betweendifferent processes. To enable this feature, you must modify your connectionURI to use thes3+ddb scheme and add a query parameterddbTableName with thename of the table to use.

importlancedbdb=lancedb.connect("s3+ddb://bucket/path?ddbTableName=my-dynamodb-table",)
importlancedbasync_db=awaitlancedb.connect_async("s3+ddb://bucket/path?ddbTableName=my-dynamodb-table",)
constlancedb=require("lancedb");constdb=awaitlancedb.connect("s3+ddb://bucket/path?ddbTableName=my-dynamodb-table",);

The DynamoDB table must be created with the following schema:

  • Hash key:base_uri (string)
  • Range key:version (number)

You can create this programmatically with:

importboto3dynamodb=boto3.client("dynamodb")table=dynamodb.create_table(TableName=table_name,KeySchema=[{"AttributeName":"base_uri","KeyType":"HASH"},{"AttributeName":"version","KeyType":"RANGE"},],AttributeDefinitions=[{"AttributeName":"base_uri","AttributeType":"S"},{"AttributeName":"version","AttributeType":"N"},],ProvisionedThroughput={"ReadCapacityUnits":1,"WriteCapacityUnits":1},)

import{CreateTableCommand,DynamoDBClient,}from"@aws-sdk/client-dynamodb";constdynamodb=newDynamoDBClient({region:CONFIG.awsRegion,credentials:{accessKeyId:CONFIG.awsAccessKeyId,secretAccessKey:CONFIG.awsSecretAccessKey,},endpoint:CONFIG.awsEndpoint,});constcommand=newCreateTableCommand({TableName:table_name,AttributeDefinitions:[{AttributeName:"base_uri",AttributeType:"S",},{AttributeName:"version",AttributeType:"N",},],KeySchema:[{AttributeName:"base_uri",KeyType:"HASH"},{AttributeName:"version",KeyType:"RANGE"},],ProvisionedThroughput:{ReadCapacityUnits:1,WriteCapacityUnits:1,},});awaitclient.send(command);

S3-compatible stores

LanceDB can also connect to S3-compatible stores, such as MinIO. To do so, you must specify both region and endpoint:

importlancedbdb=lancedb.connect("s3://bucket/path",storage_options={"region":"us-east-1","endpoint":"http://minio:9000",})
importlancedbasync_db=awaitlancedb.connect_async("s3://bucket/path",storage_options={"region":"us-east-1","endpoint":"http://minio:9000",})
import*aslancedbfrom"@lancedb/lancedb";constdb=awaitlancedb.connect("s3://bucket/path",{storageOptions:{region:"us-east-1",endpoint:"http://minio:9000",}});
constlancedb=require("lancedb");constdb=awaitlancedb.connect("s3://bucket/path",{storageOptions:{region:"us-east-1",endpoint:"http://minio:9000",}});

This can also be done with theAWS_ENDPOINT andAWS_DEFAULT_REGION environment variables.

Local servers

For local development, the server often has ahttp endpoint rather than asecurehttps endpoint. In this case, you must also set theALLOW_HTTPenvironment variable totrue to allow non-TLS connections, or pass thestorage optionallow_http astrue. If you do not do this, you will getan error likeURL scheme is not allowed.

S3 Express

LanceDB supportsS3 Express One Zone endpoints, but requires additional infrastructure configuration for the compute service, such as EC2 or Lambda. Please refer toNetworking requirements for S3 Express One Zone.

To configure LanceDB to use an S3 Express endpoint, you must set the storage options3_express. The bucket name in your table URI shouldinclude the suffix.

importlancedbdb=lancedb.connect("s3://my-bucket--use1-az4--x-s3/path",storage_options={"region":"us-east-1","s3_express":"true",})
importlancedbasync_db=awaitlancedb.connect_async("s3://my-bucket--use1-az4--x-s3/path",storage_options={"region":"us-east-1","s3_express":"true",})
import*aslancedbfrom"@lancedb/lancedb";constdb=awaitlancedb.connect("s3://my-bucket--use1-az4--x-s3/path",{storageOptions:{region:"us-east-1",s3Express:"true",}});
constlancedb=require("lancedb");constdb=awaitlancedb.connect("s3://my-bucket--use1-az4--x-s3/path",{storageOptions:{region:"us-east-1",s3Express:"true",}});

Google Cloud Storage

GCS credentials are configured by setting theGOOGLE_SERVICE_ACCOUNT environment variable to the path of a JSON file containing the service account credentials. Alternatively, you can pass the path to the JSON file in thestorage_options:

importlancedbdb=lancedb.connect("gs://my-bucket/my-database",storage_options={"service_account":"path/to/service-account.json",})
importlancedbasync_db=awaitlancedb.connect_async("gs://my-bucket/my-database",storage_options={"service_account":"path/to/service-account.json",})
import*aslancedbfrom"@lancedb/lancedb";constdb=awaitlancedb.connect("gs://my-bucket/my-database",{storageOptions:{serviceAccount:"path/to/service-account.json",}});
constlancedb=require("lancedb");constdb=awaitlancedb.connect("gs://my-bucket/my-database",{storageOptions:{serviceAccount:"path/to/service-account.json",}});

HTTP/2 support

By default, GCS uses HTTP/1 for communication, as opposed to HTTP/2. This improves maximum throughput significantly. However, if you wish to use HTTP/2 for some reason, you can set the environment variableHTTP1_ONLY tofalse.

The following keys can be used as both environment variables or keys in thestorage_options parameter:

KeyDescription
google_service_account /service_accountPath to the service account JSON file.
google_service_account_keyThe serialized service account key.
google_application_credentialsPath to the application credentials.

Azure Blob Storage

Azure Blob Storage credentials can be configured by setting theAZURE_STORAGE_ACCOUNT_NAMEandAZURE_STORAGE_ACCOUNT_KEY environment variables. Alternatively, you can pass the account name and key in thestorage_options parameter:

importlancedbdb=lancedb.connect("az://my-container/my-database",storage_options={account_name:"some-account",account_key:"some-key",})
importlancedbasync_db=awaitlancedb.connect_async("az://my-container/my-database",storage_options={account_name:"some-account",account_key:"some-key",})
import*aslancedbfrom"@lancedb/lancedb";constdb=awaitlancedb.connect("az://my-container/my-database",{storageOptions:{accountName:"some-account",accountKey:"some-key",}});
constlancedb=require("lancedb");constdb=awaitlancedb.connect("az://my-container/my-database",{storageOptions:{accountName:"some-account",accountKey:"some-key",}});

These keys can be used as both environment variables or keys in thestorage_options parameter:

KeyDescription
azure_storage_account_nameThe name of the azure storage account.
azure_storage_account_keyThe serialized service account key.
azure_client_idService principal client id for authorizing requests.
azure_client_secretService principal client secret for authorizing requests.
azure_tenant_idTenant id used in oauth flows.
azure_storage_sas_keyShared access signature. The signature is expected to be percent-encoded, much like they are provided in the azure storage explorer or azure portal.
azure_storage_tokenBearer token.
azure_storage_use_emulatorUse object store with azurite storage emulator.
azure_endpointOverride the endpoint used to communicate with blob storage.
azure_use_fabric_endpointUse object store with url scheme account.dfs.fabric.microsoft.com.
azure_msi_endpointEndpoint to request a imds managed identity token.
azure_object_idObject id for use with managed identity authentication.
azure_msi_resource_idMsi resource id for use with managed identity authentication.
azure_federated_token_fileFile containing token for Azure AD workload identity federation.
azure_use_azure_cliUse azure cli for acquiring access token.
azure_disable_taggingDisables tagging objects. This can be desirable if not supported by the backing store.

[8]ページ先頭

©2009-2025 Movatter.jp