Manage vector indexes
This document describes how to create and manage vector indexes toaccelerate your vector searches.
A vector index is a data structure designed to let theVECTOR_SEARCH functionandAI.SEARCH functionexecute more efficiently, especially on large datasets.When using an index, these search functions useApproximate Nearest Neighbor (ANN)algorithms to reduce query latency and computational cost. While ANN introducesa degree of approximation, meaning thatrecallmight not be 100%, the performance improvements typically offer an advantagefor most applications.
Roles and permissions
To create a vector index, you need thebigquery.tables.createIndex IAM permissionon the table where you're creating the index. To drop a vector index, you needthebigquery.tables.deleteIndex permission. Each of the following predefinedIAM roles includes the permissions that you need to work withvector indexes:
- BigQuery Data Owner (
roles/bigquery.dataOwner) - BigQuery Data Editor (
roles/bigquery.dataEditor)
Choose a vector index type
BigQuery offers two vector index types,IVF andTreeAH, each supportingdifferent use cases. BigQuery supports batching for vectorsearch by processing multiple rows of the input data in theVECTOR_SEARCH.For small query batches, IVF indexes are preferred. For large query batches,TreeAH indexes, which are built with Google'sScaNN algorithm,are preferred.
IVF Index
IVF is an inverted file index, which uses a k-means algorithm to cluster thevector data, and then partitions the vector data based on those clusters. TheVECTOR_SEARCH andAI.SEARCH functionscan use these partitions to reduce the amount of data it needs to read in orderto determine a result.
TreeAH Index
The TreeAH index type is named for its combination of a tree-like structureand its use of Asymmetric Hashing (AH), a core quantization technique from theunderlyingScaNN algorithm.A TreeAH index works as follows:
- The base table is divided into smaller, more manageable shards.
- A clustering model is trained, with the number of clusters derived fromthe
leaf_node_embedding_countoption in thetree_ah_optionsargument of theCREATE VECTOR INDEXstatement. - The vectors are compressed with product quantization, a technique thatreduces their memory usage. The compressed vectors are then stored in theindex tables instead of the original vectors, thus reducing vector indexsizes.
- When the
VECTOR_SEARCHorAI.SEARCHfunction runs, a candidate list foreach queryvector is efficiently computed using asymmetric hashing, which ishardware-optimized for approximate distance calculations. These candidatesare then re-scored and re-ranked using exact embeddings.
The TreeAH algorithm is optimized for batch queries that process hundreds ormore query vectors. The use of product quantization can significantly reducelatency and cost, potentially by orders of magnitude compared to IVF. However,due to increased overhead, the IVF algorithm might be better when you have asmaller number of query vectors.
We suggest you try the TreeAH index type if your use case meets the followingcriteria:
Your table contains 200 million rows or fewer.
You frequently execute large batch queries involving hundreds or more queryvectors.
For small batch queries with the TreeAH index type,VECTOR_SEARCH orAI.SEARCH mightrevert tobrute-force search.When this occurs, aIndexUnusedReasonis provided to explain why the vector index was not utilized.
Create an IVF vector index
To create an IVF vector index, use theCREATE VECTOR INDEXdata definition language (DDL) statement:
Go to theBigQuery page.
In the query editor, run the following SQL statement:
To create aIVF vector index:
CREATE[ORREPLACE]VECTORINDEX[IFNOTEXISTS]INDEX_NAMEONDATASET_NAME.TABLE_NAME(COLUMN_NAME)STORING(STORED_COLUMN_NAME[,...])OPTIONS(index_type='IVF',distance_type='DISTANCE_TYPE',ivf_options='{"num_lists":NUM_LISTS}')
Replace the following:
INDEX_NAME: the name of the vector index you'recreating. Since the index is always created in the same project anddataset as the base table, there is no need to specify these in the name.DATASET_NAME: the name of the dataset thatcontains the table.TABLE_NAME: the name of the table that containsthe column with embeddings data.COLUMN_NAME: the name of a column that containsthe embeddings data. The column must have a type ofARRAY<FLOAT64>, orif you are usingautonomous embedding generation,a type ofSTRUCT<result ARRAY<FLOAT64>, status STRING>.In all cases, all elements in theembedding array must be non-
NULL, and all values in the column musthave the same array dimensions.If the column type is
STRUCT<result ARRAY<FLOAT64>, status STRING>, thentheSTRUCTvalue can beNULLor theresultarray can beNULL. Anyrows withNULLfor these values are ignored.STORED_COLUMN_NAME: the name of a top-level columnin the table to store in the vector index. The column type can't beRANGE. Stored columns are not used if the table has a row-level accesspolicy or the column has a policy tag. For information about how toenable stored columns, seeStore columns and pre-filter.DISTANCE_TYPE: specifies the default distance typeto use when performing a vector search using this index. The supportedvalues areEUCLIDEAN,COSINE,andDOT_PRODUCT.EUCLIDEANis the default.The index creation itself always uses
EUCLIDEANdistance for trainingbut the distance used in your search function can be different.If you specify a value for the
distance_typeargument of theVECTOR_SEARCHorAI.SEARCHfunction, that value is used instead oftheDISTANCE_TYPEvalue.NUM_LISTS: anINT64value that specifies thenumber of lists that the IVF index clusters and then partitions yourvector data into. This value must be 5,000 or less. During indexing,vectors are assigned to the list corresponding to their nearest clustercentroid. If you omit this argument, BigQuery determines adefault value based on your data characteristics. The default value workswell for most use cases.NUM_LISTScontrols query tuning granularity.Higher values create more lists, so you can set thefraction_lists_to_searchoption of your search function to scana smaller percentage of the index. For example, scanning 1% of 100 listsas opposed to scanning 10% of 10 lists. This enables finer control of thesearch speed and recall but slightly increases the indexing cost. Set thisargument value based on how precisely you need to tune query scope.
The following example creates a vector index on theembedding columnofmy_table:
CREATETABLEmy_dataset.my_table(embeddingARRAY<FLOAT64>);CREATEVECTORINDEXmy_indexONmy_dataset.my_table(embedding)OPTIONS(index_type='IVF');
The following example creates a vector index on theembedding columnofmy_table, and specifies the distance type to use and the IVF options:
CREATETABLEmy_dataset.my_table(embeddingARRAY<FLOAT64>);CREATEVECTORINDEXmy_indexONmy_dataset.my_table(embedding)OPTIONS(index_type='IVF',distance_type='COSINE',ivf_options='{"num_lists": 2500}')
The following example creates a table withautonomous embedding generationenabled and creates a vector index on the table. Thedescription_embeddingembedding column is automatically generated based on thedescription column.
CREATETABLEmydataset.products(descriptionSTRING,description_embeddingSTRUCT<resultARRAY<FLOAT64>,statusSTRING>GENERATEDALWAYSAS(AI.EMBED(description,connection_id=>'us.example_connection',endpoint=>'text-embedding-005'))STOREDOPTIONS(asynchronous=TRUE));CREATEVECTORINDEXmy_indexONmy_dataset.my_table(description_embedding)OPTIONS(index_type='IVF');
Create a TreeAH vector index
To create a TreeAH vector index, use theCREATE VECTOR INDEXdata definition language (DDL) statement:
Go to theBigQuery page.
In the query editor, run the following SQL statement:
CREATE[ORREPLACE]VECTORINDEX[IFNOTEXISTS]INDEX_NAMEONDATASET_NAME.TABLE_NAME(COLUMN_NAME)STORING(STORED_COLUMN_NAME[,...])OPTIONS(index_type='TREE_AH',distance_type='DISTANCE_TYPE',tree_ah_options='{"leaf_node_embedding_count":LEAF_NODE_EMBEDDING_COUNT, "normalization_type":"NORMALIZATION_TYPE"}')
Replace the following:
INDEX_NAME: the name of the vector index thatyou are creating. Since the index is always created in the same projectand dataset as the base table, there is no need to specify these inthe name.DATASET_NAME: the name of the dataset thatcontains the table.TABLE_NAME: the name of the table that containsthe column with embeddings data.COLUMN_NAME: the name of a column that containsthe embeddings data. The column must have a type ofARRAY<FLOAT64>, orif you are usingautonomous embedding generation,a type ofSTRUCT<result ARRAY<FLOAT64>, status STRING>.In all cases, all elements in theembedding array must be non-
NULL, and all values in the column musthave the same array dimensions. The array dimension must be at least 2.If the column type is
STRUCT<result ARRAY<FLOAT64>, status STRING>, thentheSTRUCTvalue can beNULLor theresultarray can beNULL. Anyrows withNULLfor these values are ignored.STORED_COLUMN_NAME: the name of a top-level columnin the table to store in the vector index. The column type can't beRANGE. Stored columns are not used if the table has a row-level accesspolicy or the column has a policy tag. For information about how toenable stored columns, seeStore columns and pre-filter.DISTANCE_TYPE: an optional argument thatspecifies the default distance typeto use when performing a vector search using this index. The supportedvalues areEUCLIDEAN,COSINE,andDOT_PRODUCT.EUCLIDEANis the default.The index creation itself always uses
EUCLIDEANdistance for trainingbut the distance used in the search function can be different.If you specify a value for the
distance_typeargument of theVECTOR_SEARCHorAI.SEARCHfunction, that value is used instead oftheDISTANCE_TYPEvalue.LEAF_NODE_EMBEDDING_COUNT: anINT64valuegreater than or equal to 500 that specifies the approximate number ofvectors in each leaf node of the tree that the TreeAH algorithm creates.The TreeAH algorithm divides the whole data space into a number of lists,with each list containing approximatelyLEAF_NODE_EMBEDDING_COUNTdata points. A lower value creates more lists with fewer data points, while alarger value creates fewer lists with more data points. The default is1,000, which is appropriate for most datasets.NORMALIZATION_TYPE: aSTRINGvalue. Thesupported values areNONEorL2.The default isNONE. Normalization happens before any processing, forboth the base table data and the query data, but doesn't modify theembedding columnCOLUMN_NAMEinTABLE_NAME. Depending on the dataset, theembedding model, and the distance type used duringsearch, normalizing the embeddings might improve recall.
The following example creates a vector index on theembedding columnofmy_table, and specifies the distance type to use and the TreeAH options:
CREATETABLEmy_dataset.my_table(idINT64,embeddingARRAY<FLOAT64>);CREATEVECTORINDEXmy_indexONmy_dataset.my_table(embedding)OPTIONS(index_type='TREE_AH',distance_type='EUCLIDEAN',tree_ah_options='{"normalization_type": "L2"}');
Filtering
The following sections explain how pre-filters and post-filters affectvector search results, and also how to pre-filter by using stored columnsand partitions in the vector index.
Pre-filters and post-filters
In BigQueryVECTOR_SEARCH orAI.SEARCH calls,both pre-filtering andpost-filtering serve to refine search results, by applying conditions based onmetadata columns associated with the vector embeddings. It is important tounderstand their differences, implementation, and impact in order to optimizequery performance, cost, and accuracy.
Pre-filtering and post-filtering are defined as follows:
- Pre-filtering: Applies filter conditions before the approximate nearestneighbor (ANN) search performs distance calculations on candidate vectors.This narrows the pool of vectors that are considered during the search.Consequently, pre-filtering often results in faster query times and reducedcomputational cost, as the ANN search evaluates fewer potential candidates.
- Post-filtering: Applies filter conditions after the initial
top_knearest neighbors have been identified by the ANN search. Thisrefines the final result set based on the specified criteria.
The placement of yourWHERE clause determines whether a filter acts as apre-filter or a post-filter.
To create a pre-filter, theWHERE clause of the query must apply to thebase table argument of the search function.The predicate must apply to a stored column, otherwise it effectively becomesa post-filter.
The following example shows how to create a pre-filter:
-- Pre-filter on a stored column. The index speeds up the query.SELECT*FROMVECTOR_SEARCH((SELECT*FROMmy_dataset.my_tableWHEREtype='animal'),'embedding',TABLEmy_dataset.my_testdata);SELECT*FROMAI.SEARCH((SELECT*FROMmy_dataset.my_tableWHEREtype='animal'),'content','dog');-- Filter on a column that isn't stored. The index is used to search the-- entire table, and then the results are post-filtered. You might see fewer-- than 5 matches returned for some embeddings.SELECTquery.test_id,base.type,distanceFROMVECTOR_SEARCH((SELECT*FROMmy_dataset.my_tableWHEREid=123),'embedding',TABLEmy_dataset.my_testdata,top_k=>5);-- Use pre-filters with brute force. The data is filtered and then searched-- with brute force for exact results.SELECTquery.test_id,base.type,distanceFROMVECTOR_SEARCH((SELECT*FROMmy_dataset.my_tableWHEREid=123),'embedding',TABLEmy_dataset.my_testdata,options=>'{"use_brute_force":true}');
To create a post-filter, theWHERE clause of the query must be applied outsideof theVECTOR_SEARCH function, so that it filters the results returned bythe search.
The following example shows how to create a post-filter:
-- Use post-filters. The index is used, but the entire table is searched and-- the post-filtering might reduce the number of results.SELECTquery.test_id,base.type,distanceFROMVECTOR_SEARCH(TABLEmy_dataset.my_table,'embedding',TABLEmy_dataset.my_testdata,top_k=>5)WHEREbase.type='animal';SELECTbase.id,distanceFROMVECTOR_SEARCH(TABLEmydataset.base_table,'embedding',(SELECTembeddingFROMmydataset.query_table),top_k=>10)WHEREtype='document'ANDyear>2022
When you use post-filtering, or when the base table filters you specifyreference non-stored columns and thus act as post-filters, the final resultset might contain fewer thantop_k rows, potentially even zero rows,if the predicate is selective. If you require a specific number of resultsafter filtering, consider specifying a largertop_k value or increasing thefraction_lists_to_search value in the search function call.
In some cases, especially if the pre-filter is very selective, pre-filteringcan also reduce the size of the result set. If this happens, try increasing thefraction_lists_to_search value in the search function call.
Pre-filter with stored columns
To further improve the efficiency of your vector index, you can specify columnsfrom your base table to store in your vector index. Using stored columns canoptimize queries that call theVECTOR_SEARCH orAI.SEARCHfunctions in the following ways:
Instead of searching an entire table, you can call a searchfunction on a query statement thatpre-filters the base table with a
WHEREclause. If your table has an index and you filter on only storedcolumns, then BigQuery optimizes the query by filtering thedata before searching, and then using the index to search the smaller resultset. If you filter on columns that aren't stored, thenBigQuery applies the filter after the table is searched, orpost-filters.The
VECTOR_SEARCHandAI.SEARCHfunctions outputs a struct calledbasethat containsall columns from the base table. Without stored columns, a potentiallyexpensive join is needed to retrieve the columns stored inbase. Ifyou use an IVF index and your query only selects stored columnsfrombase, then BigQuery optimizes your query to eliminatethat join. For TreeAH indexes, the join with the base table is not removed.Stored columns in TreeAH indexes are only used for pre-filtering purposes.
To store columns, list them in theSTORING clause of theCREATE VECTOR INDEX statement.Storing columns increases the size of the vector index, so it's bestto store only the most frequently used or filtered columns.
The following example creates a vector index with stored columns, and then runsa vector search query that only selects stored columns:
-- Create a table that contains an embedding.CREATETABLEmy_dataset.my_table(embeddingARRAY<FLOAT64>,typeSTRING,creation_timeDATETIME,idINT64);-- Create a query table that contains an embedding.CREATETABLEmy_dataset.my_testdata(embeddingARRAY<FLOAT64>,test_idINT64);-- Create a vector index with stored columns.CREATEVECTORINDEXmy_indexONmy_dataset.my_table(embedding)STORING(type,creation_time)OPTIONS(index_type='IVF');-- Select only stored columns from a vector search to avoid an expensive join.SELECTquery,base.type,distanceFROMVECTOR_SEARCH(TABLEmy_dataset.my_table,'embedding'TABLEmy_dataset.my_testdata);
Stored column limitations
- If the mode, type, or schema of a column is changed in the base table, andif it is a stored column in the vector index, then there can be a delaybefore thatchange is reflected in the vector index. Until the updates have been appliedto the index, the vector search queries use the modified stored columns fromthe base table.
- If you select a column of type
STRUCTfrom thequeryoutput of asearch query on a table that has an index with stored columns, thenthe whole query might fail.
Pre-filter with partitions
Preview
This product or feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of theService Specific Terms. Pre-GA products and features are available "as is" and might have limited support. For more information, see thelaunch stage descriptions.
Note: To give feedback or request support for this feature, contactbq-vector-search@google.comIf the table that you are creating the vector index on is partitioned, you canchoose to also partition the vector index. Partitioning the vector index hasthe following benefits:
- Partition pruning is applied to the vector indexes in addition to thetable partitions. Partition pruning occurs when the vector search uses aqualifying filter on the value of the partitioning column. This allowsBigQuery to scan the partitions that match the filter and skipthe remaining partitions. Partition pruning can decrease I/O costs.For more information on partition pruning, seeQuery partitioned tables.
- The vector search is less likely to miss results if you pre-filter on thepartitioning column.
You can only partition TreeAH vector indexes. You can't create a partitionedvector index on anautomatically generated embedding column.
Partitioning a vector index is only recommended if you use pre-filtering tolimit most of your vector searches to a few partitions.
To create a partitioned index, use thePARTITION BY clause of theCREATE VECTOR INDEX statement. ThePARTITION BY clause that you specify in theCREATE VECTOR INDEX statementmust be the same as thePARTITION BY clause specified in theCREATE TABLE statementof the table that you are creating the vector index on, as shown in thefollowing example:
-- Create a date-partitioned table.CREATETABLEmy_dataset.my_table(embeddingsARRAYidINT64,dateDATE,)PARTITIONBYdate;-- Create a partitioned vector index on the table.CREATEVECTORINDEXmy_indexONmy_dataset.my_table(embeddings)PARTITIONBYdateOPTIONS(index_type='TREE_AH',distance_type='COSINE');
If the table uses integer range or time-unit column partitioning, thepartitioning column is stored in the vector index, which increases storage cost.If a table column is used in both theSTORING andPARTITION BY clauses oftheCREATE VECTOR INDEX statement, the column is stored only once.
To use the vector index partition, filter on the partitioning column in thebase table subquery of theVECTOR_SEARCH orAI.SEARCH call.In the following example,thesamples.items table is partitioned by theproduced_date column,so the base table subquery in theVECTOR_SEARCH statement filters on theproduced_date column:
SELECTquery.id,base.id,distanceFROMVECTOR_SEARCH((SELECT*FROMmy_dataset.my_tableWHEREproduced_date='2025-01-01'),'embedding',TABLEsamples.test,distance_type=>'COSINE',top_k=>10);
Examples
Create a partitioned vector index on a datetime-partitioned table:
-- Create a datetime-partitioned table.CREATETABLEmy_dataset.my_table(idINT64,produced_dateDATETIME,embeddingsARRAY)PARTITIONBYproduced_date;-- Create a partitioned vector index on the table.CREATEVECTORINDEXindex0ONmy_dataset.my_table(embeddings)PARTITIONBYproduced_dateOPTIONS(index_type='TREE_AH',distance_type='COSINE');
Create a partitioned vector index on a timestamp-partitioned table:
-- Create a timestamp-partitioned table.CREATETABLEmy_dataset.my_table(idINT64,produced_timeTIMESTAMP,embeddingsARRAY)PARTITIONBYTIMESTAMP_TRUNC(produced_time,HOUR);-- Create a partitioned vector index on the table.CREATEVECTORINDEXindex0ONmy_dataset.my_table(embeddings)PARTITIONBYTIMESTAMP_TRUNC(produced_time,HOUR)OPTIONS(index_type='TREE_AH',distance_type='COSINE');
Create a partitioned vector index on an integer range-partitioned table:
-- Create a integer range-partitioned table.CREATETABLEmy_dataset.my_table(idINT64,embeddingsARRAY)PARTITIONBYRANGE_BUCKET(id,GENERATE_ARRAY(-100,100,20));-- Create a partitioned vector index on the table.CREATEVECTORINDEXindex0ONmy_dataset.my_table(embeddings)PARTITIONBYRANGE_BUCKET(id,GENERATE_ARRAY(-100,100,20))OPTIONS(index_type='TREE_AH',distance_type='COSINE');
Create a partitioned vector index on an ingestion time-partitioned table:
-- Create a ingestion time-partitioned table.CREATETABLEmy_dataset.my_table(idINT64,embeddingsARRAY)PARTITIONBYTIMESTAMP_TRUNC(_PARTITIONTIME,DAY);-- Create a partitioned vector index on the table.CREATEVECTORINDEXindex0ONmy_dataset.my_table(embeddings)PARTITIONBYTIMESTAMP_TRUNC(_PARTITIONTIME,DAY)OPTIONS(index_type='TREE_AH',distance_type='COSINE');
Pre-filtering limitations
- You can't uselogical views in your pre-filter.
- If your pre-filter contains asubquery, it mightinterfere with index usage.
Understanding when data is indexed
Vector indexes are fully managed by BigQuery and areautomatically refreshed when the indexed table changes.
Indexing is asynchronous. There is a delay between adding new rows to thebase table and the new rows being reflected in the index. However, theVECTOR_SEARCH andAI.SEARCH functions still takes all rows into account anddon't missunindexed rows. The functions search using the index for indexed records, anduse brute force search for the records that aren't yet indexed.
If you create a vector index on anautomatically generated embedding column,then index training starts as soon as at least 80% of the rowshave generated embeddings.
If you create a vector index on a table that is smaller than 10 MB, then thevector index isn't populated. Similarly, if you delete data from an indexedtable and the table size falls below 10 MB, then the vector index is temporarilydisabled. In this case, vector search queries don't use the index and theindexUnusedReasons code in thevectorSearchStatisticssection of theJob resource isBASE_TABLE_TOO_SMALL. Without the index,your search function automatically falls back to using brute force to find thenearest neighbors of embeddings.
If you delete the indexed column in a table, or rename the table itself, thevector index is automatically deleted.
Monitoring the status of vector indexes
You can monitor the health of your vector indexes by queryingINFORMATION_SCHEMA views. The following views contain metadata on vectorindexes:
The
INFORMATION_SCHEMA.VECTOR_INDEXESviewhas information about the vector indexes in a dataset.After the
CREATE VECTOR INDEXstatement completes, the index must stillbe populated before you can use it. You can use thelast_refresh_timeandcoverage_percentagecolumns to verify the readiness of a vectorindex. If the vector index isn't ready, you can still use theVECTOR_SEARCHandAI.SEARCHfunctions on a table, they just mightrun more slowly without the index.The
INFORMATION_SCHEMA.VECTOR_INDEX_COLUMNSviewhas information about the vector-indexed columns for all tables in a dataset.The
INFORMATION_SCHEMA.VECTOR_INDEX_OPTIONSviewhas information about the options used by the vector indexes in a dataset.
Vector index examples
The following example shows all active vector indexes on tables in the datasetmy_dataset, located in the projectmy_project. It includes their names, theDDL statements used to create them, and their coverage percentage. If anindexed base table is less than 10 MB, then its index is not populated, inwhich case thecoverage_percentage value is 0.
SELECTtable_name,index_name,ddl,coverage_percentageFROMmy_project.my_dataset.INFORMATION_SCHEMA.VECTOR_INDEXESWHEREindex_status='ACTIVE';
The result is similar to the following:
+------------+------------+-------------------------------------------------------------------------------------------------+---------------------+| table_name | index_name | ddl | coverage_percentage |+------------+------------+-------------------------------------------------------------------------------------------------+---------------------+| table1 | indexa | CREATE VECTOR INDEX `indexa` ON `my_project.my_dataset.table1`(embeddings) | 100 || | | OPTIONS (distance_type = 'EUCLIDEAN', index_type = 'IVF', ivf_options = '{"num_lists": 100}') | |+------------+------------+-------------------------------------------------------------------------------------------------+---------------------+| table2 | indexb | CREATE VECTOR INDEX `indexb` ON `my_project.my_dataset.table2`(vectors) | 42 || | | OPTIONS (distance_type = 'COSINE', index_type = 'IVF', ivf_options = '{"num_lists": 500}') | |+------------+------------+-------------------------------------------------------------------------------------------------+---------------------+| table3 | indexc | CREATE VECTOR INDEX `indexc` ON `my_project.my_dataset.table3`(vectors) | 98 || | | OPTIONS (distance_type = 'DOT_PRODUCT', index_type = 'TREE_AH', | || | | tree_ah_options = '{"leaf_node_embedding_count": 1000, "normalization_type": "NONE"}') | |+------------+------------+-------------------------------------------------------------------------------------------------+---------------------+Vector index columns examples
The following query extracts information on columns that have vector indexes:
SELECTtable_name,index_name,index_column_name,index_field_pathFROMmy_project.dataset.INFORMATION_SCHEMA.VECTOR_INDEX_COLUMNS;
The result is similar to the following:
+------------+------------+-------------------+------------------+| table_name | index_name | index_column_name | index_field_path |+------------+------------+-------------------+------------------+| table1 | indexa | embeddings | embeddings || table2 | indexb | vectors | vectors || table3 | indexc | vectors | vectors |+------------+------------+-------------------+------------------+
Vector index options examples
The following query extracts information on vector index options:
SELECTtable_name,index_name,option_name,option_type,option_valueFROMmy_project.dataset.INFORMATION_SCHEMA.VECTOR_INDEX_OPTIONS;
The result is similar to the following:
+------------+------------+------------------+------------------+-------------------------------------------------------------------+| table_name | index_name | option_name | option_type | option_value |+------------+------------+------------------+------------------+-------------------------------------------------------------------+| table1 | indexa | index_type | STRING | IVF || table1 | indexa | distance_type | STRING | EUCLIDEAN || table1 | indexa | ivf_options | STRING | {"num_lists": 100} || table2 | indexb | index_type | STRING | IVF || table2 | indexb | distance_type | STRING | COSINE || table2 | indexb | ivf_options | STRING | {"num_lists": 500} || table3 | indexc | index_type | STRING | TREE_AH || table3 | indexc | distance_type | STRING | DOT_PRODUCT || table3 | indexc | tree_ah_options | STRING | {"leaf_node_embedding_count": 1000, "normalization_type": "NONE"} |+------------+------------+------------------+------------------+-------------------------------------------------------------------+Verifying vector index usage
Information on vector index usage is available in the job metadata of thejob that ran the vector search query. You canview job metadata by usingthe Google Cloud console, the bq command-line tool, the BigQuery API, or theclient libraries.
When you use the Google Cloud console, you can find vector index usageinformation in theVector Index Usage Mode andVector Index Unused Reasons fields.
When you use the bq tool or the BigQuery API, you canfind vector index usage information in theVectorSearchStatisticssection of theJob resource.
The index usage mode indicates whether a vector index was used by providingone of the following values:
UNUSED: No vector index was used.PARTIALLY_USED: Some search functions in the query used vectorindexes and some didn't.FULLY_USED: Every search function in the query used a vector index.
When the index usage mode value isUNUSED orPARTIALLY_USED,the index unused reasons indicate why vector indexes weren't used in the query.
For example, the following results returned bybq show --format=prettyjson -j my_job_id shows that the index was not usedbecause theuse_brute_force option was specified in theVECTOR_SEARCHfunction:
"vectorSearchStatistics": { "indexUnusedReasons": [ { "baseTable": { "datasetId": "my_dataset", "projectId": "my_project", "tableId": "my_table" }, "code": "INDEX_SUPPRESSED_BY_FUNCTION_OPTION", "message": "No vector index was used for the base table `my_project:my_dataset.my_table` because use_brute_force option has been specified." } ], "indexUsageMode": "UNUSED"}Index management options
To create indexes and have BigQuery maintain them,you have two options:
- Use the default shared slot pool: When the data you planto index isbelow your per-organizationlimit, youcan use the free shared slot pool for index management.
- Use your own reservation:To achieve more predictable and consistent indexingprogress on your larger production workloads, you can use your ownreservations for index management.
Use shared slots
If you have not configured your project to use adedicated reservation for indexing,index management is handled in the free, shared slot pool, subject to thefollowing constraints.
If you add data to a table which causes the total size of indexedtables to exceed your organization'slimit,BigQuery pauses index managementfor that table. When this happens, theindex_status field in theINFORMATION_SCHEMA.VECTOR_INDEXES viewdisplaysPENDING DISABLEMENT and the index is queued for deletion. Whilethe index is pending disablement, it isstill used in queries and you are charged for the index storage.After the index is deleted, theindex_status field showsthe index asTEMPORARILY DISABLED. In this state, queries don't use the index,and you are not charged for index storage. In this case, theIndexUnusedReason codeisBASE_TABLE_TOO_LARGE.
If you delete data from the table and the total size of indexed tablesfalls below the per-organization limit, then index management is resumed. Theindex_status field in theINFORMATION_SCHEMA.VECTOR_INDEXESview isACTIVE, queries can use the index, and you are charged for theindex storage.
You can use theINFORMATION_SCHEMA.SEARCH_INDEXES_BY_ORGANIZATION viewto understand your current consumption towards the per-organization limit in agiven region, broken down by projects and tables.
BigQuery does not make guarantees about the availablecapacity of the shared pool or the throughput of indexing you see.For production applications, you might want to usededicated slots for your index processing.
Use your own reservation
Instead of using the default shared slot pool, you can optionally designate yourown reservation to index your tables. Using your own reservation ensurespredictable and consistent performance of index-management jobs, such ascreation, refresh, and background optimizations.
- There are no table size limits when an indexing job runs in yourreservation.
- Using your own reservation gives you flexibility in your index management.If you need to create a very large index ormake a major update to an indexed table, you can temporarily add moreslots to the assignment.
To index the tables in a project with a designated reservation,create a reservationin the region where your tables are located. Then, assign the project to thereservation with thejob_type set toBACKGROUND, which shares resourcesacross background optimization jobs:
SQL
Use theCREATE ASSIGNMENT DDL statement.
In the Google Cloud console, go to theBigQuery page.
In the query editor, enter the following statement:
CREATEASSIGNMENT`ADMIN_PROJECT_ID.region-LOCATION.RESERVATION_NAME.ASSIGNMENT_ID`OPTIONS(assignee='projects/PROJECT_ID',job_type='BACKGROUND');
Replace the following:
ADMIN_PROJECT_ID: the project ID of theadministration project that owns the reservation resourceLOCATION: thelocation of the reservationRESERVATION_NAME: the name of the reservationASSIGNMENT_ID: the ID of the assignmentThe ID must be unique to the project and location, start and end with a lowercase letter or a number, and contain only lowercase letters, numbers, and dashes.
PROJECT_ID: the ID of the project containing the tables to index. This project is assigned to the reservation.
ClickRun.
For more information about how to run queries, seeRun an interactive query.
bq
Use thebq mk command:
bq mk \ --project_id=ADMIN_PROJECT_ID \ --location=LOCATION \ --reservation_assignment \ --reservation_id=RESERVATION_NAME \ --assignee_id=PROJECT_ID \ --job_type=BACKGROUND \ --assignee_type=PROJECT
Replace the following:
ADMIN_PROJECT_ID: the project ID of theadministration projectthat owns the reservation resourceLOCATION: thelocation of the reservationRESERVATION_NAME: the name of thereservationPROJECT_ID: the ID of the project to assignto this reservation
View your indexing jobs
A new indexing job is created every time an index is created or updated ona single table. To view information about the job, query theINFORMATION_SCHEMA.JOBS* views. Youcan filter for indexing jobs bysettingjob_type IS NULL AND SEARCH(job_id, '`search_index`') in theWHEREclause of your query. The following example lists the five most recent indexingjobs in the projectmy_project:
SELECT*FROMregion-us.INFORMATION_SCHEMA.JOBSWHEREproject_id='my_project'ANDjob_typeISNULLANDSEARCH(job_id,'`search_index`')ORDERBYcreation_timeDESCLIMIT5;
Choose your reservation size
To choose the right number of slots for your reservation, you should considerwhen index-management jobs are run, how many slots they use, and what your usagelooks like over time. BigQuery triggers an index-management jobin the following situations:
- You create an index on a table.
- Data is modified in an indexed table.
- The schema of a table changes and this affects which columns are indexed.
- Index data and metadata are periodically optimized or updated.
The number of slots you need for an index-management job on a table depends onthe following factors:
- The size of the table
- The rate of data ingestion to the table
- The rate of DML statements applied to the table
- The acceptable delay for building and maintaining the index
- The complexity of the index, typically determined by attributes of the data,such as the number of duplicate terms
Monitor Usage and Progress
The best way to assess the number of slots you need to efficiently run yourindex-management jobs is to monitor your slot utilization and adjust thereservation size accordingly. The following query produces the daily slot usagefor index-management jobs. Only the past 30 days are included in theregionus-west1:
SELECTTIMESTAMP_TRUNC(job.creation_time,DAY)ASusage_date,-- Aggregate total_slots_ms used for index-management jobs in a day and divide-- by the number of milliseconds in a day. This value is most accurate for-- days with consistent slot usage.SAFE_DIVIDE(SUM(job.total_slot_ms),(1000*60*60*24))ASaverage_daily_slot_usageFROM`region-us-west1`.INFORMATION_SCHEMA.JOBSjobWHEREproject_id='my_project'ANDjob_typeISNULLANDSEARCH(job_id,'`search_index`')GROUPBYusage_dateORDERBYusage_dateDESClimit30;
When there are insufficient slots to run index-management jobs, an index canbecome out of sync with its table and indexing jobs might fail.In this case, BigQuery rebuilds the index from scratch. Toavoid having an out-of-sync index, ensure you have enough slots to support indexupdates from data ingestion and optimization. For more information onmonitoring slot usage, seeadmin resource charts.
Rebuild a vector index
Preview
This product or feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of theService Specific Terms. Pre-GA products and features are available "as is" and might have limited support. For more information, see thelaunch stage descriptions.
Note: To provide feedback or request support for this feature, send an email tobq-vector-search@google.com.When table data changes significantly after a vector index is created, thevector index can become less efficient. When a vector index is less efficient,a vector search query that initially had highrecallwhen using the index will have lower recall, because the data distributionshift in the base table isn't represented in the vector index.
If you want to improve recall without increasing search query latency, rebuildthe vector index. Alternatively, you can increase the value of the vectorsearch'sfraction_lists_to_search option to improve recall, but thistypically makes the search query slower. To find out when an index was lastbuilt, run the following query:
SELECTlast_model_build_timeFROMDATASET_NAME.INFORMATION_SCHEMA.VECTOR_INDEXESWHEREtable_name=TABLE_NAME;
You can use theVECTOR_INDEX.STATISTICS functionto calculate how much an indexed table's data has drifted between when avector index was created and the present. If table data has changed enoughto require a vector index rebuild, you can use theALTER VECTOR INDEX REBUILD statementto rebuild the vector index.
Follow these steps to rebuild a vector index:
Go to theBigQuery page.
In the query editor, run the following SQL statement to check the indexedtable's data drift:
SELECT*FROMVECTOR_INDEX.STATISTICS(TABLEDATASET_NAME.TABLE_NAME);
Replace the following:
DATASET_NAME: the name of the dataset thatcontains the indexed table.TABLE_NAME: the name of the table that containsthe vector index.
The function returns a
FLOAT64value in the range[0,1). A lower valueindicates less drift. Typically, a value of0.3or greater is considereda significant enough change to indicate that a vector index rebuild mightbe beneficial.If the
VECTOR_INDEX.STATISTICSfunction indicates that table data drift issignificant, run the following SQL statement to rebuild the vector index:ALTERVECTORINDEXIFEXISTSINDEX_NAMEONDATASET_NAME.TABLE_NAMEREBUILD;
Replace the following:
INDEX_NAME: the name of the vector index thatyou are rebuilding.DATASET_NAME: the name of the dataset thatcontains the indexed table.TABLE_NAME: the name of the table that containsthe vector index.
Monitor a rebuild
To monitor your index rebuild, query theINFORMATION_SCHEMA.VECTOR_INDEXES view.For example, the following query shows the rebuild status of all indexes ontables in the datasetmy_dataset in the projectmy_project. If an indexhas no ongoing rebuild, then the value forlast_index_alteration_info isNULL.
SELECTtable_name,index_name,last_index_alteration_info.statusASstatus,last_index_alteration_info.new_coverage_percentageAScoverageFROMmy_project.my_dataset.INFORMATION_SCHEMA.VECTOR_INDEXESThe result looks similar to the following:
+------------+------------+-------------+----------+| table_name | index_name | status | coverage |+------------+------------+-------------+----------+| table1 | index_a | IN_PROGRESS | 50 || table2 | index_b | null | null |+------------+------------+-------------+----------+Cancel a rebuild
To cancel a vector index rebuild, use theBQ.CANCEL_INDEX_ALTERATION system procedure.An index alteration operation might complete before the cancellation request isprocessed. To confirm whether the operation was successfully cancelled, querytheINFORMATION_SCHEMA.VECTOR_INDEXES view.If the cancellation wassuccessful, thelast_model_build_time field isn't updated and thelast_index_alteration_info field isn't present for that index.
Delete a vector index
When you no longer need a vector index or want to change which column isindexed on a table, you can delete the index on that table by using theDROP VECTOR INDEX DDL statement:
Go to theBigQuery page.
In the query editor, run the following SQL statement:
DROPVECTORINDEXINDEX_NAMEONDATASET_NAME.TABLE_NAME;
Replace the following:
INDEX_NAME: the name of the vector index thatyou are deleting.DATASET_NAME: the name of the dataset thatcontains the indexed table.TABLE_NAME: the name of the table that containsthe vector index.
If an indexed table is deleted, its index is deleted automatically.
Export embeddings to Vertex AI Vector Search
To enable ultra-low latency online applications, use BigQueryintegration with Vertex AIVector Searchto import your BigQuery embeddings into Vector Searchand deploy low latency endpoints. For more information, seeImport index data from BigQuery.
What's next
- For an overview of vector index use cases, pricing, and limitations, see theIntroduction to vector search.
- Learn how to perform a vector search using the
VECTOR_SEARCHfunction. - Learn how to perform semantic search using the
AI.SEARCHfunction. - Learn more about the
CREATE VECTOR INDEXstatement. - Try theSearch embeddings with vector searchtutorial.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2026-02-19 UTC.