The AI.SIMILARITY function

Preview

This feature is subject to the "Pre-GA Offerings Terms" in the General Service Terms section of theService Specific Terms. Pre-GA features are available "as is" and might have limited support. For more information, see thelaunch stage descriptions.

Note: To provide feedback or request support for this feature during thepreview, contactbqml-feedback@google.com.

This document describes theAI.SIMILARITY function, which lets youfind thecosine similaritybetween two inputs. Values closer to 1 indicate more similar inputs and valuescloser to 0 indicate less similar inputs.

# Returns 0.9166 because 'happy' and 'glad' are synonyms.SELECTAI.SIMILARITY('happy','glad',endpoint=>'text-embedding-005');

Use cases include the following:

  • Semantic search: Search for text or images based off a description,without having to match specific keywords.
  • Recommendation: Return entities with attributes similar to a givenentity.
  • Classification: Return the class of entities whose attributes aresimilar to the given entity.

The function works by creating embeddings for the inputs, and thendetermining the cosine similarity between the embeddings.TheAI.SIMILARITY function sends requests to astable Vertex AI embedding model.The function incurs charges inVertex AI each time it's called.

Input

You can use theAI.SIMILARITY function with data from BigQuerystandard tables.AI.SIMILARITYaccepts the following types of input:

When you analyze image data, the content must be in one of the supported imageformats that are described in the Gemini API modelmimeType parameter.

Syntax

Text embedding

AI.SIMILARITY(content1=>'CONTENT1',content2=>'CONTENT2',endpoint=>'ENDPOINT'[,model_params=>MODEL_PARAMS][,connection_id=>'CONNECTION_ID'])

Arguments

AI.SIMILARITY takes the following arguments:

  • CONTENT1: aSTRING value that provides the first value to compare. The value ofCONTENT1 can be a string literal, the name of a table column, or the output of an expression that evaluates to a string.
  • CONTENT2: aSTRING value that provides the second value to compare. The value ofCONTENT2 can be a string literal, the name of a table column, or the output of an expression that evaluates to a string.
  • ENDPOINT: aSTRING value that specifies the Vertex AI endpoint to use for the text embedding model. If you specify the model name, such as'text-embedding-005', rather than a URL, then BigQuery ML automatically identifies the model and uses the model's full endpoint.
  • MODEL_PARAMS: aJSON literal that provides additional parameters to the model. You can use any of theparameters object fields. One of these fields,outputDimensionality, lets you specify the number of dimensions to use when generating embeddings. For example, if you specify256 for theoutputDimensionality field, then the model returns a 256-dimensional embedding for each input value.
  • CONNECTION_ID: aSTRING value specifying the connection to use to communicate with the model, in the formatPROJECT_ID.LOCATION.CONNECTION_ID. For example,myproject.us.myconnection.

    For user-initiated queries, theCONNECTION argument is optional. When a user initiates a query, BigQuery ML uses the credentials of the user who submitted the query to run it.

    If your query job is expected to run for 48 hours or longer, you should use theCONNECTION argument to run the query using a service account.

    Replace the following:

    • PROJECT_ID: the project ID of the project that contains the connection.
    • LOCATION: thelocation used by the connection. The connection must be in the same region in which the query is run.
    • CONNECTION_ID: the connection ID—for example,myconnection.

      You can get this value byviewing the connection details in the Google Cloud console and copying the value in the last section of the fully qualified connection ID that is shown inConnection ID. For example,projects/myproject/locations/connection_location/connections/myconnection.

    You need to grant theVertex AI User role to the connection's service account in the project where you run the function.

Multimodal embedding

AI.SIMILARITY(content1=>'CONTENT1',content2=>'CONTENT2',connection_id=>'CONNECTION_ID',endpoint=>'ENDPOINT'[,model_params=>MODEL_PARAMS])

Arguments

AI.SIMILARITY takes the following arguments:

  • CONTENT1: aSTRING,ObjectRef, orObjectRefRuntimevalue that provides the first value to compare.
    • For text embeddings, you can specify one of the following:
      • A string literal.
      • The name of aSTRING column.
      • The output of an expression that evaluates to a string.
    • For image content embeddings, you can specify one of the following:
      • The name of anObjectRef column.
      • AnObjectRef value generated by a combination of theOBJ.FETCH_METADATA andOBJ.MAKE_REF functions. For exampleSELECT OBJ.FETCH_METADATA(OBJ.MAKE_REF("gs://mybucket/path/to/file.jpg", "us.connection1"));.
      • AnObjectRefRuntime value generated by theOBJ.GET_ACCESS_URL function.

    ObjectRef values must have thedetails.gcs_metadata.content_type elements of theJSON value populated.

    ObjectRefRuntime values must have theaccess_url.read_url anddetails.gcs_metadata.content_type elements of theJSON value populated.

  • CONTENT2: aSTRING,ObjectRef, orObjectRefRuntimevalue that provides the second value to compare.
    • For text embeddings, you can specify one of the following:
      • A string literal.
      • The name of aSTRING column.
      • The output of an expression that evaluates to a string.
    • For image content embeddings, you can specify one of the following:
      • The name of anObjectRef column.
      • AnObjectRef value generated by a combination of theOBJ.FETCH_METADATA andOBJ.MAKE_REF functions. For exampleSELECT OBJ.FETCH_METADATA(OBJ.MAKE_REF("gs://mybucket/path/to/file.jpg", "us.connection1"));.
      • AnObjectRefRuntime value generated by theOBJ.GET_ACCESS_URL function.
  • CONNECTION_ID: aSTRING value specifying the connection to use to communicate with the model, in the formatPROJECT_ID.LOCATION.CONNECTION_ID. For example,myproject.us.myconnection.

    Replace the following:

    • PROJECT_ID: the project ID of the project that contains the connection.
    • LOCATION: thelocation used by the connection. The connection must be in the same location as the dataset that contains the model.
    • CONNECTION_ID: the connection ID—for example,myconnection.

      You can get this value byviewing the connection details in the Google Cloud console and copying the value in the last section of the fully qualified connection ID that is shown inConnection ID. For example,projects/myproject/locations/connection_location/connections/myconnection.

    You need to grant theVertex AI User role to the connection's service account in the project where you run the function.

  • ENDPOINT: aSTRING value that specifies the Vertex AI endpoint to use for the multimodal embedding model. If you specify the model name rather than a URL, BigQuery ML automatically identifies the model and uses the model's full endpoint.
  • MODEL_PARAMS: aJSON literal that provides additional parameters to the model. Only thedimension field is supported. You can use thedimension field to specify the number of dimensions to use when generating embeddings. For example, if you specify256 for thedimension field , then the model returns a 256-dimensional embedding for each input value.

Output

AI.SIMILARITY returns aFLOAT64 value that represents the cosine similaritybetween the two inputs. If an error occurs, the function returnsNULL.

Examples

The following examples demonstrate how to use theAI.SIMILARITY function.

Compare text values

The following example queries publicly available BBC news articles and showsthe top five articles most related to downward trends in the housing market:

SELECTtitle,body,AI.SIMILARITY("housing market downward trends",body,endpoint=>"text-embedding-005")ASsimilarity_scoreFROM`bigquery-public-data.bbc_news.fulltext`ORDERBYsimilarity_scoreDESCLIMIT5;

Compare text andObjectRef values

The following query creates an external table from images of pet productsstored in a publicly available Cloud Storage bucket. Then, it queries thetable for images for those that are similar to the text "aquarium device."The results include images ofproducts for maintaining water and air quality in an aquarium.

# Create a datasetCREATESCHEMAIFNOTEXISTScymbal_pets;# Create an object tableCREATEORREPLACEEXTERNALTABLEcymbal_pets.product_imagesWITHCONNECTIONDEFAULTOPTIONS(object_metadata='SIMPLE',uris=['gs://cloud-samples-data/bigquery/tutorials/cymbal-pets/images/*.png']);SELECTuri,STRING(OBJ.GET_ACCESS_URL(ref,'r').access_urls.read_url)ASsigned_url,ai.similarity("aquarium device",ref,endpoint=>'multimodalembedding@001',connection_id=>'us.example_connection')ASsimilarity_scoreFROMcymbal_pets.product_imagesORDERBYsimilarity_scoreDESCLIMIT3;

Related functions

TheAI.SIMILARITY andVECTOR_SEARCHfunctions support overlapping use cases. In general, you should useAI.SIMILARITYwhen you want to perform a small number of comparisons and you haven'tprecomputed any embeddings. You should useVECTOR_SEARCH when performanceis critical and you're working with a large number of embeddings. The followingtable summarizes the differences:

FeatureAI.SIMILARITYVECTOR_SEARCH
Function typeA scalar function that takes a single value as input and returns a single value as output.A table-valued function (TVF) that takes a table as input and returns a table as output.
Primary purposeComputes the semantic similarity score between two specific content inputs.Finds the top-K closest embeddings from a base table or query to a given embedding.
InputTypically twoSTRING or ObjectRef columns or values.A query, a base table, and columns to search.
Traditionally operates on pre-computed embedding columns of typeARRAY<FLOAT64>.
OutputA singleFLOAT64 value per row, representing the cosine similarity between the inputs.A table containing the nearest neighbor rows from the base table, and the distance or similarity scores depending on options passed to the function.
EmbeddingAlways generates embeddings for both content inputs at runtime using Vertex AI embedding LLMs.Uses pre-computed embeddings.
IndexingDoes not use vector indexes. Performs a direct comparison, including the cost of two embedding generations per call.Designed to use avector index for efficient approximate nearest neighbor (ANN) search over large datasets. Can also perform brute-force search without an index.
Common Use CaseCalculating similarity between pairs of items in a row, such as in aSELECT orWHERE clause.
Useful to prototype queries while iterating on a design.
Useful for semantic search, recommendation systems, and retrieval augmented generation (RAG) to find the best matches from a large corpus.

Locations

You can runAI.SIMILARITY in all of theregionsthat support Gemini models, and also in theUS andEUmulti-regions.

Quotas

SeeVertex AI and Cloud AI service functions quotas and limits.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-16 UTC.