Text analysis functions Stay organized with collections Save and categorize content based on your preferences.
GoogleSQL for BigQuery supports the following text analysis functions.
Function list
| Name | Summary |
|---|---|
BAG_OF_WORDS | Gets the frequency of each term (token) in a tokenized document. |
TEXT_ANALYZE | Extracts terms (tokens) from text and converts them into a tokenized document. |
TF_IDF | Evaluates how relevant a term (token) is to a tokenized document in a set of tokenized documents. |
BAG_OF_WORDS
BAG_OF_WORDS(tokenized_document)Definition
Gets the frequency of each term (token) in a tokenized document.
Definitions
tokenized_document:ARRAY<STRING>value that represents a document thathas been tokenized. A tokenized document is a collection of terms (tokens),which are used for text analysis.
Return type
ARRAY<STRUCT<term STRING, count INT64>>
Definitions:
term: A unique term in the tokenized document.count: The number of times the term was found in the tokenized document.
Examples
The following query produces terms and their frequencies in twotokenized documents:
WITHExampleTableAS(SELECT1ASid,['I','like','pie','pie','pie',NULL]ASfUNIONALLSELECT2ASid,['yum','yum','pie',NULL]ASf)SELECTid,BAG_OF_WORDS(f)ASresultsFROMExampleTableORDERBYid;/*----+------------------------------------------------+ | id | results | +----+------------------------------------------------+ | 1 | [(null, 1), ('I', 1), ('like', 1), ('pie', 3)] | | 2 | [(null, 1), ('pie', 1), ('yum', 2)] | +----+------------------------------------------------*/TEXT_ANALYZE
TEXT_ANALYZE(text[,analyzer=>{'LOG_ANALYZER'|'NO_OP_ANALYZER'|'PATTERN_ANALYZER'}][,analyzer_options=>analyzer_options_values])Description
Extracts terms (tokens) from text and converts them into a tokenized document.
Definitions
text:STRINGvalue that represents the input text to tokenize.analyzer: A named argument with aSTRINGvalue. Determines whichanalyzer to use to converttextinto an array of terms (tokens). This canbe:'LOG_ANALYZER'(default): Breaks the input into terms when delimitersare encountered and then normalizes the terms. Ifanalyzerisn'tspecified, this is used by default.For more information, seeLOG_ANALYZERtext analyzer.'NO_OP_ANALYZER': Extracts the text as a single term (token), butdoesn't apply normalization.For more information, seeNO_OP_ANALYZERtext analyzer.'PATTERN_ANALYZER': Breaks the input into terms that match aregular expression. For more information, seePATTERN_ANALYZERtext analyzer.
analyzer_options: A named argument with a JSON-formattedSTRINGvalue.Takes a list of text analysis rules. For more information, seeText analyzer options.
Details
There is no guarantee on the order of the tokens produced by this function.
If no analyzer is specified, theLOG_ANALYZER analyzer is used by default.
Return type
ARRAY<STRING>
Examples
The following query uses the default text analyzer,LOG_ANALYZER, with the input text:
SELECTTEXT_ANALYZE('I like pie, you like-pie, they like 2 PIEs.')ASresults/*--------------------------------------------------------------------------+ | results | +--------------------------------------------------------------------------+ | ['i', 'like', 'pie', 'you', 'like', 'pie', 'they', 'like', '2', 'pies' ] | +--------------------------------------------------------------------------*/The following query uses theNO_OP_ANALYZER text analyzerwith the input text:
SELECTTEXT_ANALYZE('I like pie, you like-pie, they like 2 PIEs.',analyzer=>'NO_OP_ANALYZER')ASresults/*-----------------------------------------------+ | results | +-----------------------------------------------+ | 'I like pie, you like-pie, they like 2 PIEs.' | +-----------------------------------------------*/The following query uses thePATTERN_ANALYZERtext analyzer with the input text:
SELECTTEXT_ANALYZE('I like pie, you like-pie, they like 2 PIEs.',analyzer=>'PATTERN_ANALYZER')ASresults/*----------------------------------------------------------------+ | results | +----------------------------------------------------------------+ | ['like', 'pie', 'you', 'like', 'pie', 'they', 'like', 'pies' ] | +----------------------------------------------------------------*/For additional examples that include analyzer options,seeText analysis.
For helpful analyzer recipes that you can use to enhanceanalyzer-supported queries, seeSearch with text analyzers.
TF_IDF
TF_IDF(tokenized_document)OVER()TF_IDF(tokenized_document,max_distinct_tokens)OVER()TF_IDF(tokenized_document,max_distinct_tokens,frequency_threshold)OVER()Description
Evaluates how relevant a term is to a tokenized document in a set oftokenized documents, using the TF-IDF (term frequency-inverse document frequency)algorithm.
Definitions
tokenized_document:ARRAY<STRING>value that represents a document thathas been tokenized. A tokenized document is a collection of terms (tokens),which are used for text analysis.max_distinct_tokens: Optional argument. Takes a non-negativeINT64value, which represents the size of the dictionary, excluding theunknown term.Terms are added to the dictionary until this threshold is met. So, if thisvalue is
20, the first 20 unique terms are added and then no additionalterms are added.If this argument isn't provided, the default value is
32000.If this argument is specified, the maximum value is1048576.frequency_threshold: Optional argument. Takes a non-negativeINT64valuethat represents the minimum number of times a term must appear in atokenized document to be included in the dictionary. So, if this value is3, a term must appear at least three times in the tokenized document tobe added to the dictionary.If this argument isn't provided, the default value is
5.
Details
This function uses a TF-IDF (term frequency-inverse document frequency)algorithm to compute the relevance of terms in a set of tokenized documents.TF-IDF multiplies two metrics: how many times a term appears in a document(term frequency), and the inverse document frequency of the term across acollection of documents (inverse document frequency).
TDIF:
termfrequency*inversedocumentfrequencyterm frequency:
(countoftermindocument)/(documentsize)inverse document frequency:
log(1+documentsetsize/(1+countofdocumentscontainingterm))
Terms are added to a dictionary of terms if they satisfy the criteria formax_distinct_tokens andfrequency_threshold, otherwise they are consideredtheunknown term. The unknown term is always the first term in the dictionaryand represented asNULL. The rest of the dictionary is ordered byterm frequency rather than alphabetically.
Return type
ARRAY<STRUCT<term STRING, tf_idf DOUBLE>>
Definitions:
term: The unique term that was added to the dictionary.tf_idf: The TF-IDF computation for the term.
Examples
The following query computes the relevance of up to 10 terms that appear atleast twice in a set of tokenized documents. In this example,10 representsmax_distinct_tokens and2 representsfrequency_threshold:
WITHExampleTableAS(SELECT1ASid,['I','like','pie','pie','pie',NULL]ASfUNIONALLSELECT2ASid,['yum','yum','pie',NULL]ASfUNIONALLSELECT3ASid,['I','yum','pie',NULL]ASfUNIONALLSELECT4ASid,['you','like','pie','too',NULL]ASf)SELECTid,TF_IDF(f,10,2)OVER()ASresultsFROMExampleTableORDERBYid;/*----+-------------------------------------------------+ | id | results | +----+-------------------------------------------------+ | 1 | [{"index":null,"value":"0.1304033435859887"}, | | | {"index":"I","value":"0.1412163100645339"}, | | | {"index":"like","value":"0.1412163100645339"}, | | | {"index":"pie","value":"0.29389333245105953"}] | +----+-------------------------------------------------+ | 2 | [{"index":null,"value":"0.1956050153789831"}, | | | {"index":"pie","value":"0.14694666622552977"}, | | | {"index":"yum","value":"0.4236489301936017"}] | +----+-------------------------------------------------+ | 3 | [{"index":null,"value":"0.1956050153789831"}, | | | {"index":"I","value":"0.21182446509680086"}, | | | {"index":"pie","value":"0.14694666622552977"}, | | | {"index":"yum","value":"0.21182446509680086"}] | +----+-------------------------------------------------+ | 4 | [{"index":null,"value":"0.4694520369095594"}, | | | {"index":"like","value":"0.1694595720774407"}, | | | {"index":"pie","value":"0.11755733298042381"}] | +----+-------------------------------------------------*/The following query computes the relevance of up to three terms that appear atleast once in a set of tokenized documents:
WITHExampleTableAS(SELECT1ASid,['I','like','pie','pie','pie',NULL]ASfUNIONALLSELECT2ASid,['yum','yum','pie',NULL]ASfUNIONALLSELECT3ASid,['I','yum','pie',NULL]ASfUNIONALLSELECT4ASid,['you','like','pie','too',NULL]ASf)SELECTid,TF_IDF(f,3,2)OVER()ASresultsFROMExampleTableORDERBYid;/*----+-------------------------------------------------+ | id | results | +----+-------------------------------------------------+ | 1 | [{"index":null,"value":"0.12679902142647365"}, | | | {"index":"I","value":"0.1412163100645339"}, | | | {"index":"like","value":"0.1412163100645339"}, | | | {"index":"pie","value":"0.29389333245105953"}] | +----+-------------------------------------------------+ | 2 | [{"index":null,"value":"0.5705955964191315"}, | | | {"index":"pie","value":"0.14694666622552977"}] | +----+-------------------------------------------------+ | 3 | [{"index":null,"value":"0.380397064279421"}, | | | {"index":"I","value":"0.21182446509680086"}, | | | {"index":"pie","value":"0.14694666622552977"}] | +----+-------------------------------------------------+ | 4 | [{"index":null,"value":"0.45647647713530515"}, | | | {"index":"like","value":"0.1694595720774407"}, | | | {"index":"pie","value":"0.11755733298042381"}] | +----+-------------------------------------------------*/Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-11-24 UTC.