Autonomous embedding generation

Preview

This feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of theService Specific Terms. Pre-GA features are available "as is" and might have limited support. For more information, see thelaunch stage descriptions.

Note: To give feedback or request support for this feature, contactbq-vector-search@google.com

This document describes how to use autonomous embedding generation for yourdata, which lets BigQuery maintain a column of embeddings on a tablebased on a source column. When you add or modify data in the source column,BigQuery automatically generates or updates the embeddingcolumn for that data by using a Vertex AI embedding model.This is helpful if you want to let BigQuery maintain yourembeddings when your source data is updated regularly.

Embeddings are useful for modern generative AIapplications such as Retrieval Augmented Generation (RAG), but they can becomplex to create, manage, and query. You can useautonomous embedding generation to simplify the process of creating,maintaining, and querying embeddings for use in similarity searches and othergenerative AI applications.

For example, you can use queries similar to the following tocreate a table with autonomous embedding generationenabled, insert data, and then perform semantic search:

CREATETABLEmydataset.products(nameSTRING,descriptionSTRING,description_embeddingSTRUCT<resultARRAY<FLOAT64>,statusSTRING>GENERATEDALWAYSAS(AI.EMBED(description,connection_id=>'us.example_connection',endpoint=>'text-embedding-005'))STOREDOPTIONS(asynchronous=TRUE));# Values in the description_embedding column are automatically generated.INSERTINTOmydataset.products(name,description)VALUES('Super slingers','An exciting board game for the whole family'),...;SELECT*FROMAI.SEARCH(TABLEmydataset.products,'description','A really fun toy');

Before you begin

To enable autonomous embedding generation on a table, you must have thenecessary permissions and connection.

Required roles

To get the permissions that you need to enable autonomous embedding generation, ask your administrator to grant you the following IAM roles:

  • To use a connection resource:BigQuery Connections User (roles/bigquery.connectionUser) on the connection
  • To create a table:BigQuery Data Editor (roles/bigquery.dataEditor) on the table
  • Grant the connection's service account the following role so that it can access models hosted in Vertex AI endpoints:Vertex AI User (roles/aiplatform.user) on the project that has the connection

For more information about granting roles, seeManage access to projects, folders, and organizations.

You might also be able to get the required permissions throughcustom roles or otherpredefined roles.

Create a connection and grant permission to a service account

To enable autonomous embedding generation on a table, you mustcreate a Cloud resource connection.Then,granttheVertex AI User role(roles/aiplatform.user) to the service account that was created when youcreated the connection.

Create a table with autonomous embedding generation

You can use autonomous embedding generation to generate embeddings by usingtheAI.EMBED functionin aCREATE TABLE statement.

CREATE TABLEDATASET_ID.TABLE (  [COLUMN, ...]STRING_COL STRING,EMBEDDING_COL_NAME STRUCT<result ARRAY<FLOAT64>, status STRING>    GENERATED ALWAYS AS (      AI.EMBED(STRING_COL,        connection_id =>CONNECTION_ID,        endpoint =>ENDPOINT)    )    STORED OPTIONS (asynchronous = TRUE));

Replace the following:

  • DATASET_ID: The name of the dataset in which you want to create the table.
  • TABLE: The name of the table on which to create autonomous embeddinggeneration.
  • COLUMN, ...: Any columns that your table should contain besides the columnthat you want to automatically embed.
  • STRING_COL: The name of theSTRING column that you want to automaticallyembed.
  • EMBEDDING_COL_NAME: The name of the automatically generated embeddingcolumn.
  • CONNECTION_ID: ASTRING value that contains the name of a connectionto use, such asmy_project.us.example_connection. You must grant theVertex AI Userrole to the connection's service account in the project in which youcreate the table.
  • ENDPOINT: aSTRING value that specifies asupported Vertex AItext embedding modelendpoint to use for the text embedding model. The endpointvalue that you specify must include the model version, for exampletext-embedding-005. If you specify the model name rather than aURL, BigQuery ML automatically identifies the model and usesthe model's full endpoint.

The background embedding generation job starts shortly after your table iscreated, or after you update data in the source column.

To track the progress of the embedding generation, you can use a querysimilar to the following:

SELECTCOUNT(*)AStotal_num_rows,COUNTIF(description_embeddingISNOTNULLANDdescription_embedding.status='')AStotal_num_generated_embeddingsFROMPROJECT_ID.DATASET_ID.TABLE;

After you create the table with embeddings, you cancreate a vector indexon theSTRUCT column that contains the automatically generated embedding.

Example

Suppose you are a large retailer that sells many different products. You have atable of product names and descriptions and you want to help your customersfind the products they're looking for. The following queries show you how toset up autonomous embedding generation to assist with semantic search of yourproduct descriptions.

First, create a dataset:

CREATESCHEMAmydataset;

Next, create a table with autonomous embedding generation enabled to hold yourproduct information. The automatically generated column is calleddescription_embedding and it's based on thedescription column.

# Create a table of products and descriptions with a generated embedding column.CREATETABLEmydataset.products(nameSTRING,descriptionSTRING,description_embeddingSTRUCT<resultARRAY<FLOAT64>,statusSTRING>GENERATEDALWAYSAS(AI.EMBED(description,connection_id=>'us.example_connection',endpoint=>'text-embedding-005'))STOREDOPTIONS(asynchronous=TRUE));

The following query inserts some product names and descriptions into the table.You don't specify a value fordescription_embedding because it's generatedautomatically.

# Insert product descriptions into the table.# The description_embedding column is automatically updated.INSERTINTOmydataset.products(name,description)VALUES("Lounger chair","A comfortable chair for relaxing in."),("Super slingers","An exciting board game for the whole family."),("Encyclopedia set","A collection of informational books.");

You can optionally create a vector index on the table to speed up searching.A vector index requires more than three rows, so the following query assumesthat you have inserted additional data. Every time you insert data, thedescription_embedding column is automatically updated.

CREATEVECTORINDEXmy_indexONmydataset.products(description_embedding)OPTIONS(index_type='IVF');

Finally, you can use theAI.SEARCH functionto perform semantic search on your products for a fun toy:

# Search for products that are fun to play with.SELECTbase.name,base.description,distanceFROMAI.SEARCH(TABLEmydataset.products,'description',"A really fun toy");/*------------------+----------------------------------------------+----------------------+ | name             | description                                  | distance             | +------------------+----------------------------------------------+----------------------+ | Super slingers   | An exciting board game for the whole family. | 0.80954913893618929  | | Lounger chair    | A comfortable chair for relaxing in.         | 0.938933930620146    | | Encyclopedia set | A collection of informational books.         | 1.1119297739353384   | +------------------+----------------------------------------------+----------------------*/

Get information about automatically generated embedding columns

To verify that a column is an automatically generated embedding column, querytheINFORMATION_SCHEMA.COLUMNS view.

The following query shows you information about all of your automaticallygenerated embedding columns:

SELECT*FROMPROJECT_ID.DATASET_ID.INFORMATION_SCHEMA.COLUMNSWHEREis_generated='ALWAYS';

Thegeneration_expression field shows you the call to theAI.EMBED functionthat is used to generate the embeddings on the column.

Use your own reservation

By default, BigQuery uses on-demand slots to handle theprocessing required to maintain the generated embedding column. To ensurepredictable and consistent performance, you canoptionallycreate a reservationand set thejob_type toBACKGROUND. When a background reservation ispresent, BigQuery uses it to maintain the generated embeddingcolumn instead.

Troubleshooting

The generated embedding column contains two fields:result andstatus.If an error occurs when BigQuery tries to generate an embeddingfor a particular row in your table, then theresult field isNULL and thestatus field describes the error. For example, if the source column isNULLthen theresult embedding is alsoNULL and the status isNULL value is not supported for embedding generation.

A more severe error can stall embedding generation. In this case, you canquery theINFORMATION_SCHEMA.JOBS viewfor the background job and look at the information in theerror_result field.The job ID of a background embedding job is prefixed withgc_. For example,the following query extracts all background jobs whose error result isn'tNULL:

SELECT*FROM`region-REGION.INFORMATION_SCHEMA.JOBS`jWHEREEXISTS(SELECT1FROMunnest(j.referenced_tables)tWHEREj.project_id='PROJECT_ID'ANDt.dataset_id='DATASET_ID'ANDt.table_id='TABLE')ANDstarts_with(job_id,'gc')ANDerror_resultISNOTNULLORDERBYj.creation_timeDESC;

Track costs

Autonomous embedding generation sends requests toVertex AI, which can incur costs. To track the Vertex AIcosts incurred by background embedding jobs, follow these steps:

  1. View your billing reports inCloud Billing.
  2. Use filters to refine your results.

    For services, selectVertex AI.

  3. To see the charges for a specific job,filter by label.

    Set the key tobigquery_ml_job and the value to thejob ID of the embedding job.Background embedding jobs all have a prefix ofgc_.

It can take up to 24 hours for some charges to appear in Cloud Billing.

Limitations

  • Each table supports at most one automatically generated embedding column.
  • There is no indication that a column is automatically generated when youview a table's schema using the Google Cloud console, thebq show command,or theddl field of theINFORMATION_SCHEMA.TABLES view.
  • You can't add generated embedding columns to an existing table by usingALTER TABLE ADD COLUMN.
  • If you create a copy, clone, or snapshot of a table that has a generatedembedding column, only the data is copied. The generation configurationdoesn't apply to the new table, and updates to the source column of thenew table won't result in new embeddings.
  • If you restore a table that had autonomous embedding generation enabledfrom a snapshot, the embedding generation configuration isn't restored.
  • You can create generated embedding columns only by using SQL. You can't usethebq mk orbq update commands to create generated embedding columns.
  • The source column of the generated column must be aSTRING column.
  • After you create the generated embedding column, the followinglimitations apply:

    • You can't drop or rename the source column, but you can stilldrop or rename the generated embedding column. If you drop theembedding column, then you can drop or rename the source column.
    • You can't change the data type of the source column or generatedembedding column.
  • You can't specify default values for automaticallygenerated embedding columns.

  • You can't directly write to generated embedding columns by using thesemethods:

    • DML
    • Streaming writes
    • bq insert
    • bq copy -a
  • Tables with generated embedding columns don't support anycolumn-level security policies, such as policy tags.

  • When you call a search function, such asVECTOR_SEARCHorAI.SEARCH,rows with missing embeddings in the base table are skipped during thesearch.

  • You can't create a partitioned vector index on a table that has autonomousembedding generation enabled.

  • If you create a vector index on the automatically generated embeddingcolumn, then index training starts after at least 80% of the rowshave generated embeddings. You can use the following query to checkwhat percentage of embeddings on your table have been generated:

    SELECTCOUNTIF(description_embeddingISNOTNULLANDdescription_embedding.status='')*100.0/COUNT(*)ASpercentFROMPROJECT_ID.DATASET_ID.TABLE;

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.