Text analysis functions

GoogleSQL for BigQuery supports the following text analysis functions.

Function list

NameSummary
BAG_OF_WORDSGets the frequency of each term (token) in a tokenized document.
TEXT_ANALYZE Extracts terms (tokens) from text and converts them into a tokenized document.
TF_IDF Evaluates how relevant a term (token) is to a tokenized document in a set of tokenized documents.

BAG_OF_WORDS

BAG_OF_WORDS(tokenized_document)

Definition

Gets the frequency of each term (token) in a tokenized document.

Definitions

  • tokenized_document:ARRAY<STRING> value that represents a document thathas been tokenized. A tokenized document is a collection of terms (tokens),which are used for text analysis.

Return type

ARRAY<STRUCT<term STRING, count INT64>>

Definitions:

  • term: A unique term in the tokenized document.
  • count: The number of times the term was found in the tokenized document.

Examples

The following query produces terms and their frequencies in twotokenized documents:

WITHExampleTableAS(SELECT1ASid,['I','like','pie','pie','pie',NULL]ASfUNIONALLSELECT2ASid,['yum','yum','pie',NULL]ASf)SELECTid,BAG_OF_WORDS(f)ASresultsFROMExampleTableORDERBYid;/*----+------------------------------------------------+ | id | results                                        | +----+------------------------------------------------+ | 1  | [(null, 1), ('I', 1), ('like', 1), ('pie', 3)] | | 2  | [(null, 1), ('pie', 1), ('yum', 2)]            | +----+------------------------------------------------*/

TEXT_ANALYZE

TEXT_ANALYZE(text[,analyzer=>{'LOG_ANALYZER'|'NO_OP_ANALYZER'|'PATTERN_ANALYZER'}][,analyzer_options=>analyzer_options_values])

Description

Extracts terms (tokens) from text and converts them into a tokenized document.

Definitions

  • text:STRING value that represents the input text to tokenize.
  • analyzer: A named argument with aSTRING value. Determines whichanalyzer to use to converttext into an array of terms (tokens). This canbe:

    • 'LOG_ANALYZER' (default): Breaks the input into terms when delimitersare encountered and then normalizes the terms. Ifanalyzer isn'tspecified, this is used by default.For more information, seeLOG_ANALYZER text analyzer.

    • 'NO_OP_ANALYZER': Extracts the text as a single term (token), butdoesn't apply normalization.For more information, seeNO_OP_ANALYZER text analyzer.

    • 'PATTERN_ANALYZER': Breaks the input into terms that match aregular expression. For more information, seePATTERN_ANALYZER text analyzer.

  • analyzer_options: A named argument with a JSON-formattedSTRING value.Takes a list of text analysis rules. For more information, seeText analyzer options.

Details

There is no guarantee on the order of the tokens produced by this function.

If no analyzer is specified, theLOG_ANALYZER analyzer is used by default.

Return type

ARRAY<STRING>

Examples

The following query uses the default text analyzer,LOG_ANALYZER, with the input text:

SELECTTEXT_ANALYZE('I like pie, you like-pie, they like 2 PIEs.')ASresults/*--------------------------------------------------------------------------+ | results                                                                  | +--------------------------------------------------------------------------+ | ['i', 'like', 'pie', 'you', 'like', 'pie', 'they', 'like', '2', 'pies' ] | +--------------------------------------------------------------------------*/

The following query uses theNO_OP_ANALYZER text analyzerwith the input text:

SELECTTEXT_ANALYZE('I like pie, you like-pie, they like 2 PIEs.',analyzer=>'NO_OP_ANALYZER')ASresults/*-----------------------------------------------+ | results                                       | +-----------------------------------------------+ | 'I like pie, you like-pie, they like 2 PIEs.' | +-----------------------------------------------*/

The following query uses thePATTERN_ANALYZERtext analyzer with the input text:

SELECTTEXT_ANALYZE('I like pie, you like-pie, they like 2 PIEs.',analyzer=>'PATTERN_ANALYZER')ASresults/*----------------------------------------------------------------+ | results                                                        | +----------------------------------------------------------------+ | ['like', 'pie', 'you', 'like', 'pie', 'they', 'like', 'pies' ] | +----------------------------------------------------------------*/

For additional examples that include analyzer options,seeText analysis.

For helpful analyzer recipes that you can use to enhanceanalyzer-supported queries, seeSearch with text analyzers.

TF_IDF

TF_IDF(tokenized_document)OVER()
TF_IDF(tokenized_document,max_distinct_tokens)OVER()
TF_IDF(tokenized_document,max_distinct_tokens,frequency_threshold)OVER()

Description

Evaluates how relevant a term is to a tokenized document in a set oftokenized documents, using the TF-IDF (term frequency-inverse document frequency)algorithm.

Definitions

  • tokenized_document:ARRAY<STRING> value that represents a document thathas been tokenized. A tokenized document is a collection of terms (tokens),which are used for text analysis.
  • max_distinct_tokens: Optional argument. Takes a non-negativeINT64 value, which represents the size of the dictionary, excluding theunknown term.

    Terms are added to the dictionary until this threshold is met. So, if thisvalue is20, the first 20 unique terms are added and then no additionalterms are added.

    If this argument isn't provided, the default value is32000.If this argument is specified, the maximum value is1048576.

  • frequency_threshold: Optional argument. Takes a non-negativeINT64 valuethat represents the minimum number of times a term must appear in atokenized document to be included in the dictionary. So, if this value is3, a term must appear at least three times in the tokenized document tobe added to the dictionary.

    If this argument isn't provided, the default value is5.

Details

This function uses a TF-IDF (term frequency-inverse document frequency)algorithm to compute the relevance of terms in a set of tokenized documents.TF-IDF multiplies two metrics: how many times a term appears in a document(term frequency), and the inverse document frequency of the term across acollection of documents (inverse document frequency).

  • TDIF:

    termfrequency*inversedocumentfrequency
  • term frequency:

    (countoftermindocument)/(documentsize)
  • inverse document frequency:

    log(1+documentsetsize/(1+countofdocumentscontainingterm))

Terms are added to a dictionary of terms if they satisfy the criteria formax_distinct_tokens andfrequency_threshold, otherwise they are consideredtheunknown term. The unknown term is always the first term in the dictionaryand represented asNULL. The rest of the dictionary is ordered byterm frequency rather than alphabetically.

Return type

ARRAY<STRUCT<term STRING, tf_idf DOUBLE>>

Definitions:

  • term: The unique term that was added to the dictionary.
  • tf_idf: The TF-IDF computation for the term.

Examples

The following query computes the relevance of up to 10 terms that appear atleast twice in a set of tokenized documents. In this example,10 representsmax_distinct_tokens and2 representsfrequency_threshold:

WITHExampleTableAS(SELECT1ASid,['I','like','pie','pie','pie',NULL]ASfUNIONALLSELECT2ASid,['yum','yum','pie',NULL]ASfUNIONALLSELECT3ASid,['I','yum','pie',NULL]ASfUNIONALLSELECT4ASid,['you','like','pie','too',NULL]ASf)SELECTid,TF_IDF(f,10,2)OVER()ASresultsFROMExampleTableORDERBYid;/*----+-------------------------------------------------+ | id | results                                         | +----+-------------------------------------------------+ | 1  | [{"index":null,"value":"0.1304033435859887"},   | |    |  {"index":"I","value":"0.1412163100645339"},    | |    |  {"index":"like","value":"0.1412163100645339"}, | |    |  {"index":"pie","value":"0.29389333245105953"}] | +----+-------------------------------------------------+ | 2  | [{"index":null,"value":"0.1956050153789831"},   | |    |  {"index":"pie","value":"0.14694666622552977"}, | |    |  {"index":"yum","value":"0.4236489301936017"}]  | +----+-------------------------------------------------+ | 3  | [{"index":null,"value":"0.1956050153789831"},   | |    |  {"index":"I","value":"0.21182446509680086"},   | |    |  {"index":"pie","value":"0.14694666622552977"}, | |    |  {"index":"yum","value":"0.21182446509680086"}] | +----+-------------------------------------------------+ | 4  | [{"index":null,"value":"0.4694520369095594"},   | |    |  {"index":"like","value":"0.1694595720774407"}, | |    |  {"index":"pie","value":"0.11755733298042381"}] | +----+-------------------------------------------------*/

The following query computes the relevance of up to three terms that appear atleast once in a set of tokenized documents:

WITHExampleTableAS(SELECT1ASid,['I','like','pie','pie','pie',NULL]ASfUNIONALLSELECT2ASid,['yum','yum','pie',NULL]ASfUNIONALLSELECT3ASid,['I','yum','pie',NULL]ASfUNIONALLSELECT4ASid,['you','like','pie','too',NULL]ASf)SELECTid,TF_IDF(f,3,2)OVER()ASresultsFROMExampleTableORDERBYid;/*----+-------------------------------------------------+ | id | results                                         | +----+-------------------------------------------------+ | 1  | [{"index":null,"value":"0.12679902142647365"},  | |    |  {"index":"I","value":"0.1412163100645339"},    | |    |  {"index":"like","value":"0.1412163100645339"}, | |    |  {"index":"pie","value":"0.29389333245105953"}] | +----+-------------------------------------------------+ | 2  | [{"index":null,"value":"0.5705955964191315"},   | |    |  {"index":"pie","value":"0.14694666622552977"}] | +----+-------------------------------------------------+ | 3  | [{"index":null,"value":"0.380397064279421"},    | |    |  {"index":"I","value":"0.21182446509680086"},   | |    |  {"index":"pie","value":"0.14694666622552977"}] | +----+-------------------------------------------------+ | 4  | [{"index":null,"value":"0.45647647713530515"},  | |    |  {"index":"like","value":"0.1694595720774407"}, | |    |  {"index":"pie","value":"0.11755733298042381"}] | +----+-------------------------------------------------*/

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-11-24 UTC.