Text analysis Stay organized with collections Save and categorize content based on your preferences.
This document provides an overview of text analysis, also known astext mining, in GoogleSQL for BigQuery.
GoogleSQL supports text analysis, which is a technique that you canuse to identify terms (tokens) in unstructured text, and then use those termsfor actionable insight, such as indexing and searching, or as inputs forvectorizations to be used in ML training pipelines. You can use atext analyzer to analyze information in a specific way andtext analysis options to apply your own analyzation customizations.
Text analysis is supported in the following GoogleSQL functions andstatements:
Text analyzers
GoogleSQL for BigQuery supports several types of text analyzers, which you can useto extract data from unstructured text. You can pass an analyzer into somefunctions and statements with theanalyzer argument.Each text analyzer has a unique way of extracting information. Your choices are:
NO_OP_ANALYZER: Extracts the input as asingle term (token).LOG_ANALYZER: Breaks the input into terms when delimitersare encountered.PATTERN_ANALYZER: Breaks the input into terms thatmatch a regular expression.
NO_OP_ANALYZER analyzer
TheNO_OP_ANALYZER analyzer is a no-operation analyzer, which extractsthe input text as a single term (token). No formatting is applied to theresulting term.
This analyzer doesn't support any text analyzer options or token filters.
Example
The following query usesNO_OP_ANALYZER as the text analyzer:
SELECTTEXT_ANALYZE('I like pie, you like-pie, they like 2 PIEs.',analyzer=>'NO_OP_ANALYZER')ASresults/*-----------------------------------------------+ | results | +-----------------------------------------------+ | 'I like pie, you like-pie, they like 2 PIEs.' | +-----------------------------------------------*/LOG_ANALYZER analyzer
TheLOG_ANALYZER analyzer extracts the input text as terms (tokens) whena delimiter is encountered, discards the delimiters, and then changes anyuppercase letters to lowercase letters in the results.
Details:
- An uppercase letter in a term is made lowercase, but ASCII values greaterthan 127 are kept as is.
Text is split into individual terms when one the following delimiters,such as a space, period, or other non-letter character, is encountered:
[ ] < > ( ) { } | ! ; , ' " * & ? + / : = @ . - $ % \ _ \n \r \s \t %21 %26%2526 %3B %3b %7C %7c %20 %2B %2b %3D %3d %2520 %5D %5d %5B %5b %3A %3a %0A%0a %2C %2c %28 %29If you don't want to use these default delimiters, you can specify thespecific delimiters you want to use as text analyzer options. For moreinformation, see
delimitersoption.
This analyzer supports token filters. For more information, seetoken_filters option. If thetoken_filters optionisn't specified,ASCII lowercase normalizationis used by default.
Example
The following query usesLOG_ANALYZER as the text analyzer:
SELECTTEXT_ANALYZE('I like pie, you like-pie, they like 2 PIEs.',analyzer=>'LOG_ANALYZER')ASresults/*---------------------------------------------------------------------------+ | results | +---------------------------------------------------------------------------+ | [ 'i', 'like', 'pie', 'you', 'like', 'pie', 'they', 'like', '2', 'pies' ] | +---------------------------------------------------------------------------*/BecauseLOG_ANALYZER is the default text analyzer, you don't need tospecify it in the query. For example, the following query produces the sameresults as the preceding query:
SELECTTEXT_ANALYZE('I like pie, you like-pie, they like 2 PIEs.')ASresultsPATTERN_ANALYZER analyzer
ThePATTERN_ANALYZER analyzer extracts terms (tokens) from unstructured text,using are2 regular expression.
This analyzer finds the first term from the left side of the input text thatmatches the regular expression and adds this term to the output. Then, itremoves the prefix in the input text up to the newly found term. This processis repeated until the input text is empty.
By default, the regular expression\b\w{2,}\b is used. This regular expressionmatches non-Unicode words that have at least two characters. If youwould like to use another regular expression, seepatterns option.
This analyzer supports token filters. For more information, seetoken_filters option. If thetoken_filters optionisn't specified,ASCII lowercase normalizationis used by default.
Example
The following query usesPATTERN_ANALYZER as the text analyzer.Because the default regular expression is used, only words that have two ormore characters are included as terms. Also, the results are lowercase.Notice thati and2 don't appear in the results.
SELECTTEXT_ANALYZE('I like pie, you like-pie, they like 2 PIEs.',analyzer=>'PATTERN_ANALYZER')ASresults/*----------------------------------------------------------------+ | results | +----------------------------------------------------------------+ | ['like', 'pie', 'you', 'like', 'pie', 'they', 'like', 'pies' ] | +----------------------------------------------------------------*/Text analyzer options
Text analyzers support custom options that determine how input text isanalyzed. You can pass analyzer options into some functions and statementswith theanalyzer_options argument. This argumenttakes a JSON-formattedSTRING value.
Your choices are:
delimiters: Breaks the input into terms when thesedelimiters are encountered.patterns: Breaks the input into terms thatmatch a regular expression.token_filters: After the input text has beentokenized into terms, apply filters on the terms.
delimiters analyzer option
'{ "delimiters":array_of_delimiters}'Description
If you are using theLOG_ANALYZER text analyzer and you don'twant to use the default delimiters, you can specify the specific delimiters thatyou want to use to filter the input text.
Definitions
delimiters: A JSON array containing strings that represent the delimitersto use to tokenize the input text.
Details
When there are two delimiters with the same prefix, for example:% and%2, the longer delimiter has higher precedence and is analyzed first.
You can add any ASCII string as a delimiter. The length of a delimiter must beless than or equal to 16 characters. Some common delimiters that you mightwant to include are:
[ ] < > ( ) { } | ! ; , ' " * & ? + / : = @ . - $ % \ _ \n \r \s \t %21 %26%2526 %3B %3b %7C %7c %20 %2B %2b %3D %3d %2520 %5D %5d %5B %5b %3A %3a %0A%0a %2C %2c %28 %29Example
SELECTTEXT_ANALYZE('I like pie, you like-pie, they like 2 PIEs.',analyzer=>'LOG_ANALYZER',analyzer_options=>'{"delimiters": [",", ".", "-"]}')ASresults/*-------------------------------------------------------+ | results | +-------------------------------------------------------+ | ['i like pie', 'you like', 'pie', 'they like 2 pies]' | +-------------------------------------------------------*/patterns analyzer option
'{ "patterns":array_of_regex_patterns}'Description
If you are using thePATTERN_ANALYZER text analyzer andyou don't want to use the default regular expression, you can specify theregular expression that you want to use to filter the input text.
Definitions
patterns: A JSON array which contains one string that represents theregular expression.
Details
If this analyzer option isn't provided for thePATTERN_ANALYZER text analyzer, the regular expression\b\w{2,}\b isused by default to match non-Unicode words that have at least two characters.
Example
SELECTTEXT_ANALYZE('I like pie, you like-pie, they like 2 PIEs.',analyzer=>'PATTERN_ANALYZER',analyzer_options=>'{"patterns": ["[a-zA-Z]*"]}')ASresults/*----------------------------------------------------------------+ | results | +----------------------------------------------------------------+ | ['like', 'pie', 'you', 'like', 'pie', 'they', 'like', 'pies' ] | +----------------------------------------------------------------*/token_filters analyzer option
'{ "token_filters":array_of_token_filters}'Description
If you are using theLOG_ANALYZER orPATTERN_ANALYZER text analyzer, you can sequentially applyone or more token filters to the input text after the input text has beentokenized.
Definitions
array_of_token_filters: A JSON array containing objects that representtoken filters.
Details
For more information about the specifictoken filters you can add, seeToken filters.
Example
For example, this query contains bothpatterns andtoken_filters options:
SELECTTEXT_ANALYZE('I like pie, you like-pie, they like 2 PIEs.',analyzer=>'PATTERN_ANALYZER',analyzer_options=>''' { "patterns": ["[a-zA-Z]*"], "token_filters": [ { "normalizer": { "mode": "LOWER" } }, { "stop_words": ["they", "pie"] } ] }''')ASresults/*----------------------------------------------+ | results | +----------------------------------------------+ | ['i', 'like', 'you', 'like', 'like, 'pies' ] | +----------------------------------------------*/Token filters
'{ "token_filters": [ { "normalizer": { "mode":json_string, "icu_normalize_mode":json_string, "icu_case_folding":json_boolean } }, { "stop_words":json_string_array } ]}'Token filters can modify or delete terms (tokens) that are extracted frominput text. If no token filters are specified for a text analyzer that supportstoken filters, theASCII lowercase normalizationtoken filter is applied by default. If multiple token filters are added, theyare applied in the order in which they are specified. The same token filter canbe included multiple times in thetoken_filters array. See the examples inthis section for details.
Definitions
Each token filter has a unique JSON syntax that contains some of theseJSON key-value pairs, depending upon the type of token filter you want to use:
token_filters: JSON array of objects that contain token filters. The sametype of token filter can be included multiple times in this array.stop_words: JSON array of strings that represent the words to removefrom the list of terms.normalizer: JSON object that contains the normalization settings for atoken filter. The settings include:mode: JSON string that represents the normalization mode. Yourchoices are:NONE: Don't apply normalization mode to terms.LOWER: ASCII lowercase terms. If no token filters arespecified for a text analyzer that supports token filters, thisis used by default.UNICODE_LOWER: Unicode lowercase terms. Mapping betweenlowercase and uppercase is done according to theUnicodeCharacter Database without takinginto account language-specific mappings.ICU_NORMALIZE: ICU normalize terms.
icu_normalize_mode: JSON string that represents the ICUnormalization mode. Your choices are:NFC: ApplyICU NFC normalization to terms.NFKC: ApplyICU NFKC normalization to terms.NFD: ApplyICU NFD normalization to terms.NFKD: ApplyICU NFKD normalization to terms.
You can use this if
modeisICU_NORMALIZE. IfmodeisICU_NORMALIZEand this key-value pair isn't set,icu_normalize_modeisNFKCby default.icu_case_folding: JSON boolean that determines whether to applyICU case folding to terms.trueto apply ICU case folding toterms. Otherwisefalse.You can use this if
modeisICU_NORMALIZE. IfmodeisICU_NORMALIZEand this value isn't used,icu_case_foldingistrueby default.
Details
Token filters can be used with all but theNO_OP_ANALYZER text analyzer in the same query. Token filters are appliedafter the text analyzer breaks input text into terms.
Iftoken_filters isn't specified for an analyzer that supports token filters,ASCII lowercase normalization is applied bydefault.
You can add multiple token filters to the token filters array (token_filters).If multiple token filters are added, they are applied to the terms in theorder in which they are specified. For more information, see the examples inthis section.
You can add the same token filter multiple times to the token filtersarray. For more information, see the examples inthis section.
Here are some of the filters that you can apply to terms, using thetoken filter JSON syntax:
- No normalization
- Convert to lowercase (ASCII)
- Convert to lowercase (UNICODE_LOWER)
- Convert to lowercase (ICU case folding)
- Preserve uppercase
- ICU normalize with NFC
- ICU normalize with NFKC
- ICU normalize with NFD
- ICU normalize with NFKD
- Remove words
Examples
In the following example, the terms are NFKC normalized, and then becauseICU case folding istrue, the terms are converted to lowercase. Finally,the lowercase wordspies and2 are removed from the query.
SELECTTEXT_ANALYZE('I like ❶ ② pies, you like Ño PIEs',analyzer=>'LOG_ANALYZER',analyzer_options=>''' { "token_filters": [ { "normalizer": { "mode": "ICU_NORMALIZE", "icu_normalize_mode": "NFKC", "icu_case_folding": true } }, { "stop_words": ["pies", "2"] } ] }''')ASresults/*------------------------------------------+ | results | +------------------------------------------+ | ['i', 'like', '❶', 'you', 'like', 'ño' ] | +------------------------------------------*/The following query is similar to the preceding one, but the order oftoken filters is re-ordered, and this affects the outcome of the query. In theresults,2 andPIEs is retained because② is normalized to2andPIEs is normalized topies after the stop words token filter is applied:
SELECTTEXT_ANALYZE('I like ❶ ② pies, you like Ño PIEs',analyzer=>'LOG_ANALYZER',analyzer_options=>''' { "token_filters": [ { "stop_words": ["pies", "2"] }, { "normalizer": { "mode": "ICU_NORMALIZE", "icu_normalize_mode": "NFKC", "icu_case_folding": true } } ] }''')ASresults/*-------------------------------------------------------+ | results | +-------------------------------------------------------+ | ['i', 'like', '❶', '2', 'you', 'like', 'ño', 'pies' ] | +-------------------------------------------------------*/You can use the same token filter as many times as you'd like in a query. Inthe following query,stop_words is used twice:
SELECTTEXT_ANALYZE('I like ❶ ② pies, you like Ño PIEs',analyzer=>'LOG_ANALYZER',analyzer_options=>''' { "token_filters": [ { "stop_words": ["like", "you"] }, { "normalizer": { "mode": "ICU_NORMALIZE", "icu_normalize_mode": "NFKC", "icu_case_folding": true } }, { "stop_words": ["ño"] } ] }''')ASresults/*----------------------------------+ | results | +----------------------------------+ | ['i', '❶', '2', 'pies', 'pies' ] | +----------------------------------*/No normalization
'{ "token_filters": [ "normalizer": { "mode": "NONE" } ]}'Description
Normalization isn't applied to terms.
Example
In the following query, normalization isn't applied to the results:
SELECTTEXT_ANALYZE('I like ❶ ② pies, you like Ño PIEs',analyzer=>'LOG_ANALYZER',analyzer_options=>''' { "token_filters": [ { "normalizer": { "mode": "NONE" } } ] }''')ASresults/*----------------------------------------------------------------+ | results | +----------------------------------------------------------------+ | ['I', 'like', '❶', '②', 'pies', 'you', 'like', 'Ño', 'PIEs' ] | +----------------------------------------------------------------*/Convert to lowercase (ASCII)
'{ "token_filters": [ "normalizer": { "mode": "LOWER" } ]}'Description
Performs ASCII lowercasing on the resulting terms.
Example
In the following query, ASCII lowercasing is applied to the results:
SELECTTEXT_ANALYZE('I like ❶ ② pies, you like Ño PIEs',analyzer=>'LOG_ANALYZER',analyzer_options=>''' { "token_filters": [ { "normalizer": { "mode": "LOWER" } } ] }''')ASresults/*----------------------------------------------------------------+ | results | +----------------------------------------------------------------+ | ['i', 'like', '❶', '②', 'pies', 'you', 'like', 'Ño', 'pies' ] | +----------------------------------------------------------------*/Convert to lowercase (Unicode)
'{ "token_filters": [ "normalizer": { "mode": "UNICODE_LOWER" } ]}'Description
Performs Unicode lowercasing on the resulting terms. Mapping between lowercaseand uppercase is done according to theUnicode CharacterDatabase without taking into accountlanguage-specific mappings.
Example
In the following query, Unicode lowercasing is applied to the results:
SELECTTEXT_ANALYZE('I like ❶ ② pies, you like Ño PIEs',analyzer=>'LOG_ANALYZER',analyzer_options=>''' { "token_filters": [ { "normalizer": { "mode": "UNICODE_LOWER" } } ] }''')ASresults/*----------------------------------------------------------------+ | results | +----------------------------------------------------------------+ | ['i', 'like', '❶', '②', 'pies', 'you', 'like', 'ño', 'pies' ] | +----------------------------------------------------------------*/Convert to lowercase (ICU case folding)
'{ "token_filters": [ "normalizer": { "mode": "ICU_NORMALIZE", "icu_case_folding": true } ]}'Description
Performs ICU case folding, which converts the resulting terms to lowercase.
Example
In the following query, ICU case folding is applied to the results:
SELECTTEXT_ANALYZE('I like ❶ ② pies, you like Ño PIEs',analyzer=>'LOG_ANALYZER',analyzer_options=>''' { "token_filters": [ { "normalizer": { "mode": "ICU_NORMALIZE", "icu_case_folding": true } } ] }''')ASresults/*--------------------------------------------------------------+ | results | +--------------------------------------------------------------+ | ['i', 'like', '❶', '2' 'pies', 'you', 'like', 'ño', 'pies' ] | +--------------------------------------------------------------*/Preserve uppercase
'{ "token_filters": [ "normalizer": { "mode": "ICU_NORMALIZE", "icu_case_folding": false } ]}'Description
Don't convert uppercase characters to lowercase characters in the resultingterms.
Example
In the following query, ICU case folding isn't applied to the results:
SELECTTEXT_ANALYZE('I like ❶ ② pies, you like Ño PIEs',analyzer=>'LOG_ANALYZER',analyzer_options=>''' { "token_filters": [ { "normalizer": { "mode": "ICU_NORMALIZE", "icu_case_folding": false } } ] }''')ASresults/*---------------------------------------------------------------+ | results | +---------------------------------------------------------------+ | ['I', 'like', '❶', '2' 'pies', 'you', 'like', 'Ño', 'PIEs' ] | +---------------------------------------------------------------*/ICU normalize with NFC
'{ "token_filters": [ "normalizer": { "mode": "ICU_NORMALIZE", "icu_normalize_mode": "NFC" } ]}'Description
Normalizes text withICU NFC normalization, which decomposes andrecomposes characters by canonical equivalence.
Example
In the following query, NFC normalization is applied to the results:
SELECTTEXT_ANALYZE('I like ❶ ② pies, you like Ño PIEs',analyzer=>'LOG_ANALYZER',analyzer_options=>''' { "token_filters": [ { "normalizer": { "mode": "ICU_NORMALIZE", "icu_normalize_mode": "NFC" } } ] }''')ASresults/*---------------------------------------------------------------+ | results | +---------------------------------------------------------------+ | ['i', 'like', '❶', '②' 'pies', 'you', 'like', 'ño', 'pies' ] | +---------------------------------------------------------------*/ICU normalize with NFKC
'{ "token_filters": [ "normalizer": { "mode": "ICU_NORMALIZE", "icu_normalize_mode": "NFKC" } ]}'Description
Normalizes text withICU NFKC normalization, which decomposescharacters by compatibility, and then recomposes the characters bycanonical equivalence.
Example
In the following query, NFKC normalization is applied to the results:
SELECTTEXT_ANALYZE('I like ❶ ② pies, you like Ño PIEs',analyzer=>'LOG_ANALYZER',analyzer_options=>''' { "token_filters": [ { "normalizer": { "mode": "ICU_NORMALIZE", "icu_normalize_mode": "NFKC" } } ] }''')ASresults/*---------------------------------------------------------------+ | results | +---------------------------------------------------------------+ | ['i', 'like', '❶', '2' 'pies', 'you', 'like', 'ño', 'pies' ] | +---------------------------------------------------------------*/ICU normalize with NFD
'{ "token_filters": [ "normalizer": { "mode": "ICU_NORMALIZE", "icu_normalize_mode": "NFD" } ]}'Description
Normalizes text withICU NFD normalization, which decomposescharacters by canonical equivalence, and then arranges multiple combiningcharacters in a specific order.
Example
In the following query, although the input and output forñ lookthe same, the bytes are different (input is\u00f1, output is\u006e \u0303).
SELECTTEXT_ANALYZE('I like ❶ ② pies, you like Ño PIEs',analyzer=>'LOG_ANALYZER',analyzer_options=>''' { "token_filters": [ { "normalizer": { "mode": "ICU_NORMALIZE", "icu_normalize_mode": "NFD" } } ] }''')ASresults/*---------------------------------------------------------------+ | results | +---------------------------------------------------------------+ | ['i', 'like', '❶', '2' 'pies', 'you', 'like', 'ño', 'pies' ] | +---------------------------------------------------------------*/ICU normalize with NFKD
'{ "token_filters": [ "normalizer": { "mode": "ICU_NORMALIZE", "icu_normalize_mode": "NFKD" } ]}'Description
Normalizes text withICU NFKD normalization, which decomposescharacters by compatibility, and then arranges multiple combiningcharacters in a specific order.
Example
In the following query, although the input and output forñ lookthe same, the bytes are different (input is\u00f1, output is\u006e \u0303).
SELECTTEXT_ANALYZE('I like ❶ ② pies, you like Ño PIEs',analyzer=>'LOG_ANALYZER',analyzer_options=>''' { "token_filters": [ {"normalizer": { "mode": "ICU_NORMALIZE", "icu_normalize_mode": "NFKD" } } ] }''')ASresults/*---------------------------------------------------------------+ | results | +---------------------------------------------------------------+ | ['i', 'like', '❶', '2' 'pies', 'you', 'like', 'ño', 'pies' ] | +---------------------------------------------------------------*/Remove words
'{ "token_filters": [ "stop_words":array_of_stop_words ]}'Description
Exclude a list of terms (tokens) from the results.
Definitions
array_of_stop_words: A JSON array containing strings that represent terms.These terms shouldn't be included in the results. The array must have atleast one element. An empty string is a valid array element.
Example
In the following query, the wordsthey andpie are excluded from theresults:
SELECTTEXT_ANALYZE('I like pie, you like-pie, they like 2 PIEs.',analyzer=>'LOG_ANALYZER',analyzer_options=>''' { "token_filters": [ { "stop_words": ["they", "pie"] } ] }''')ASresults/*---------------------------------------------------+ | results | +---------------------------------------------------+ | ['I', 'like', 'you', 'like', 'like, '2', 'PIEs' ] | +---------------------------------------------------*/Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-15 UTC.