The AI.SIMILARITY function
Preview
This feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of theService Specific Terms. Pre-GA features are available "as is" and might have limited support. For more information, see thelaunch stage descriptions.
Note: To provide feedback or request support for this feature during thepreview, contactbqml-feedback@google.com.This document describes theAI.SIMILARITY function, which lets youfind thecosine similaritybetween two inputs. Values closer to 1 indicate more similar inputs and valuescloser to 0 indicate less similar inputs.
# Returns 0.9166 because 'happy' and 'glad' are synonyms.SELECTAI.SIMILARITY('happy','glad',endpoint=>'text-embedding-005');Use cases include the following:
- Semantic search: Search for text or images based off a description,without having to match specific keywords.
- Recommendation: Return entities with attributes similar to a givenentity.
- Classification: Return the class of entities whose attributes aresimilar to the given entity.
The function works by creating embeddings for the inputs, and thendetermining the cosine similarity between the embeddings.TheAI.SIMILARITY function sends requests to astable Vertex AI embedding model.The function incurs charges inVertex AI each time it's called.
Input
You can use theAI.SIMILARITY function with data from BigQuerystandard tables.AI.SIMILARITYaccepts the following types of input:
- Text data.
- Image data represented by
ObjectRefvalues. (Preview) - Image data represented by
ObjectRefRuntimevalues.ObjectRefRuntimevalues are generated by theOBJ.GET_ACCESS_URLfunction.You can useObjectRefvalues from standard tables as input to theOBJ.GET_ACCESS_URLfunction. (Preview)
When you analyze image data, the content must be in one of the supported imageformats that are described in the Gemini API modelmimeType parameter.
Syntax
Text embedding
AI.SIMILARITY(content1=>'CONTENT1',content2=>'CONTENT2',endpoint=>'ENDPOINT'[,model_params=>MODEL_PARAMS][,connection_id=>'CONNECTION_ID'])
Arguments
AI.SIMILARITY takes the following arguments:
CONTENT1: aSTRINGvalue that provides the first value to compare. The value ofCONTENT1can be a string literal, the name of a table column, or the output of an expression that evaluates to a string.CONTENT2: aSTRINGvalue that provides the second value to compare. The value ofCONTENT2can be a string literal, the name of a table column, or the output of an expression that evaluates to a string.ENDPOINT: aSTRINGvalue that specifies the Vertex AI endpoint to use for the text embedding model. If you specify the model name, such as'text-embedding-005', rather than a URL, then BigQuery ML automatically identifies the model and uses the model's full endpoint.MODEL_PARAMS: aJSONliteral that provides additional parameters to the model. You can use any of theparametersobject fields. One of these fields,outputDimensionality, lets you specify the number of dimensions to use when generating embeddings. For example, if you specify256for theoutputDimensionalityfield, then the model returns a 256-dimensional embedding for each input value.CONNECTION_ID: aSTRINGvalue specifying the connection to use to communicate with the model, in the formatPROJECT_ID.LOCATION.CONNECTION_ID. For example,myproject.us.myconnection.For user-initiated queries, the
CONNECTIONargument is optional. When a user initiates a query, BigQuery ML uses the credentials of the user who submitted the query to run it.If your query job is expected to run for 48 hours or longer, you should use the
CONNECTIONargument to run the query using a service account.Replace the following:
PROJECT_ID: the project ID of the project that contains the connection.LOCATION: thelocation used by the connection. The connection must be in the same region in which the query is run.CONNECTION_ID: the connection ID—for example,myconnection.You can get this value byviewing the connection details in the Google Cloud console and copying the value in the last section of the fully qualified connection ID that is shown inConnection ID. For example,
projects/myproject/locations/connection_location/connections/myconnection.
You need to grant theVertex AI User role to the connection's service account in the project where you run the function.
Multimodal embedding
AI.SIMILARITY(content1=>'CONTENT1',content2=>'CONTENT2',connection_id=>'CONNECTION_ID',endpoint=>'ENDPOINT'[,model_params=>MODEL_PARAMS])
Arguments
AI.SIMILARITY takes the following arguments:
CONTENT1: aSTRING,ObjectRef, orObjectRefRuntimevalue that provides the first value to compare.- For text embeddings, you can specify one of the following:
- A string literal.
- The name of a
STRINGcolumn. - The output of an expression that evaluates to a string.
- For image content embeddings, you can specify one of the following:
- The name of an
ObjectRefcolumn. - An
ObjectRefvalue generated by a combination of theOBJ.FETCH_METADATAandOBJ.MAKE_REFfunctions. For exampleSELECT OBJ.FETCH_METADATA(OBJ.MAKE_REF("gs://mybucket/path/to/file.jpg", "us.connection1"));. - An
ObjectRefRuntimevalue generated by theOBJ.GET_ACCESS_URLfunction.
- The name of an
ObjectRefvalues must have thedetails.gcs_metadata.content_typeelements of theJSONvalue populated.ObjectRefRuntimevalues must have theaccess_url.read_urlanddetails.gcs_metadata.content_typeelements of theJSONvalue populated.- For text embeddings, you can specify one of the following:
CONTENT2: aSTRING,ObjectRef, orObjectRefRuntimevalue that provides the second value to compare.- For text embeddings, you can specify one of the following:
- A string literal.
- The name of a
STRINGcolumn. - The output of an expression that evaluates to a string.
- For image content embeddings, you can specify one of the following:
- The name of an
ObjectRefcolumn. - An
ObjectRefvalue generated by a combination of theOBJ.FETCH_METADATAandOBJ.MAKE_REFfunctions. For exampleSELECT OBJ.FETCH_METADATA(OBJ.MAKE_REF("gs://mybucket/path/to/file.jpg", "us.connection1"));. - An
ObjectRefRuntimevalue generated by theOBJ.GET_ACCESS_URLfunction.
- The name of an
- For text embeddings, you can specify one of the following:
CONNECTION_ID: aSTRINGvalue specifying the connection to use to communicate with the model, in the formatPROJECT_ID.LOCATION.CONNECTION_ID. For example,myproject.us.myconnection.Replace the following:
PROJECT_ID: the project ID of the project that contains the connection.LOCATION: thelocation used by the connection. The connection must be in the same location as the dataset that contains the model.CONNECTION_ID: the connection ID—for example,myconnection.You can get this value byviewing the connection details in the Google Cloud console and copying the value in the last section of the fully qualified connection ID that is shown inConnection ID. For example,
projects/myproject/locations/connection_location/connections/myconnection.
You need to grant theVertex AI User role to the connection's service account in the project where you run the function.
ENDPOINT: aSTRINGvalue that specifies the Vertex AI endpoint to use for the multimodal embedding model. If you specify the model name rather than a URL, BigQuery ML automatically identifies the model and uses the model's full endpoint.MODEL_PARAMS: aJSONliteral that provides additional parameters to the model. Only thedimensionfield is supported. You can use thedimensionfield to specify the number of dimensions to use when generating embeddings. For example, if you specify256for thedimensionfield , then the model returns a 256-dimensional embedding for each input value.
Output
AI.SIMILARITY returns aFLOAT64 value that represents the cosine similaritybetween the two inputs. If an error occurs, the function returnsNULL.
Examples
The following examples demonstrate how to use theAI.SIMILARITY function.
Compare text values
The following example queries publicly available BBC news articles and showsthe top five articles most related to downward trends in the housing market:
SELECTtitle,body,AI.SIMILARITY("housing market downward trends",body,endpoint=>"text-embedding-005")ASsimilarity_scoreFROM`bigquery-public-data.bbc_news.fulltext`ORDERBYsimilarity_scoreDESCLIMIT5;
Compare text andObjectRef values
The following query creates an external table from images of pet productsstored in a publicly available Cloud Storage bucket. Then, it queries thetable for images for those that are similar to the text "aquarium device."The results include images ofproducts for maintaining water and air quality in an aquarium.
# Create a datasetCREATESCHEMAIFNOTEXISTScymbal_pets;# Create an object tableCREATEORREPLACEEXTERNALTABLEcymbal_pets.product_imagesWITHCONNECTIONDEFAULTOPTIONS(object_metadata='SIMPLE',uris=['gs://cloud-samples-data/bigquery/tutorials/cymbal-pets/images/*.png']);SELECTuri,STRING(OBJ.GET_ACCESS_URL(ref,'r').access_urls.read_url)ASsigned_url,ai.similarity("aquarium device",ref,endpoint=>'multimodalembedding@001',connection_id=>'us.example_connection')ASsimilarity_scoreFROMcymbal_pets.product_imagesORDERBYsimilarity_scoreDESCLIMIT3;
Related functions
TheAI.SIMILARITY andVECTOR_SEARCHfunctions support overlapping use cases. In general, you should useAI.SIMILARITYwhen you want to perform a small number of comparisons and you haven'tprecomputed any embeddings. You should useVECTOR_SEARCH when performanceis critical and you're working with a large number of embeddings. The followingtable summarizes the differences:
| Feature | AI.SIMILARITY | VECTOR_SEARCH |
|---|---|---|
| Function type | A scalar function that takes a single value as input and returns a single value as output. | A table-valued function (TVF) that takes a table as input and returns a table as output. |
| Primary purpose | Computes the semantic similarity score between two specific content inputs. | Finds the top-K closest embeddings from a base table or query to a given embedding. |
| Input | Typically twoSTRING or ObjectRef columns or values. | A query, a base table, and columns to search. Traditionally operates on pre-computed embedding columns of type ARRAY<FLOAT64>. |
| Output | A singleFLOAT64 value per row, representing the cosine similarity between the inputs. | A table containing the nearest neighbor rows from the base table, and the distance or similarity scores depending on options passed to the function. |
| Embedding | Always generates embeddings for both content inputs at runtime using Vertex AI embedding LLMs. | Uses pre-computed embeddings. |
| Indexing | Does not use vector indexes. Performs a direct comparison, including the cost of two embedding generations per call. | Designed to use avector index for efficient approximate nearest neighbor (ANN) search over large datasets. Can also perform brute-force search without an index. |
| Common Use Case | Calculating similarity between pairs of items in a row, such as in aSELECT orWHERE clause.Useful to prototype queries while iterating on a design. | Useful for semantic search, recommendation systems, and retrieval augmented generation (RAG) to find the best matches from a large corpus. |
Locations
You can runAI.SIMILARITY in all of theregionsthat support Gemini models, and also in theUS andEUmulti-regions.
Quotas
SeeVertex AI and Cloud AI service functions quotas and limits.
What's next
- For more information about using Vertex AI models togenerate text and embeddings, seeGenerative AI overview.
- For more information about using Cloud AI APIs to perform AI tasks, seeAI application overview.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-16 UTC.