Work with text analyzers

TheCREATE SEARCH INDEX DDLstatement,SEARCH function, andTEXT_ANALYZEfunctionsupport advanced text analyzer configuration options. UnderstandingBigQuery's text analyzers and their options lets you refine your searchexperience.

This document provides an overview of the different text analyzers available inBigQuery and their configuration options, as well as examples ofhow text analyzers work withsearch inBigQuery. For more information about text analyzer syntax, seeText analysis.

Text analyzers

BigQuery supports the following text analyzers:

  • NO_OP_ANALYZER
  • LOG_ANALYZER
  • PATTERN_ANALYZER

NO_OP_ANALYZER

Use theNO_OP_ANALYZER when you have pre-processed data that you want to matchexactly. There is no tokenization or normalization applied to the text. Sincethis analyzer does not perform tokenization or normalization, it accepts noconfiguration. For more information aboutNO_OP_ANALYZER, seeNO_OP_ANALYZER.

LOG_ANALYZER

TheLOG_ANALYZER modifies data in the following ways:

  • Text is made lowercase.
  • ASCII values greater than 127 are kept as is.

  • Text is split into individual terms calledtokens by the followingdelimiters:

    [] < >(){}|!;,' " * & ? + / : = @ . - $ % \ _ \n \r \s \t %21 %26%2526 %3B %3b %7C %7c %20 %2B %2b %3D %3d %2520 %5D %5d %5B %5b %3A %3a %0A%0a %2C %2c %28 %29

    If you don't want to use the default delimiters, you can specify thedelimiters you want to use as text analyzer options.LOG_ANALYZER lets youconfigure specific delimiters and token filters for more control over yoursearch results. For more information about thespecific configuration options available when using theLOG_ANALYZER, seedelimiters analyzeroptionandtoken_filters analyzeroption.

PATTERN_ANALYZER

ThePATTERN_ANALYZER text analyzer extracts tokens from text using a regularexpression. The regular expression engine and syntax used withPATTERN_ANALYZER isRE2.PATTERN_ANALYZERtokenizes patterns in the following order:

  1. It finds the first substring that matches the pattern (from the left) in the string. This is a token to be included in the output.
  2. It removes everything from the input string until the end of the substring found in step 1.
  3. It repeats the process until the string is empty.

The following table provides examples ofPATTERN_ANALYZER token extraction:

PatternInput textOutput tokens
abababab
  • ab
ababacad
  • ab
[a-z]{2}abacad
  • ab
  • ac
  • ad
aaaaaaaa
  • aaa
[a-z]/a/b/c/d/e
  • a/
  • b/
  • c/
  • d/
/[^/]+/aa/bb/cc
  • /bb/
[0-9]+abc
(?:/?)[a-z]/abc
  • /abc
(?:/)[a-z]/abc
  • /abc
(?:[0-9]abc){3}(?:[a-z]000){2}7abc7abc7abcx000y000
  • 7abc7abc7abcx000y000
".+""cats" and "dogs"
  • "cats" and "dogs"


Note the use ofgreedy quantifiers + makes the match to match the longest string possible in the text, causing '"cats" and "dogs"' to be extracted as a token in the text.
".+?""cats" and "dogs"
  • "cats"
  • "dogs"


Note the use oflazy quantifiers +? makes the regular expression match the shortest string possible in the text, causing '"cats"', '"dogs"' to be extracted as 2 separate tokens in the text.

Using thePATTERN_ANALYZER text analyzer gives you more control over thetokens extracted from a text when used with theSEARCHfunction. The followingtable shows how different patterns and results result in differentSEARCHresults:

PatternQueryTextTokens from textSEARCH(text, query)Explanation
abcabcdefabcghi
  • abcghi
TRUE'abc' in ['abcghi']
cd[a-z]abcdefabcghi
  • abcghi
FALSE'cde' in ['abcghi']
[a-z]/a/b/a/b/c/d/
  • a/
  • b/
  • c/
  • d/
TRUE'a/' in ['a/', 'b/', 'c/', 'd/'] AND 'b/' in ['a/', 'b/', 'c/', 'd/']
/[^/]+/aa/bb/aa/bb/cc/
  • /bb/
TRUE'/bb/' in ['/bb/']
/[^/]+/bbaa/bb/cc/
  • /bb/
ERRORNo match found in query term
[0-9]+abcabc123ERRORNo match found in query term
[0-9]+`abc`abc123ERRORNo match found in query term

Matching backtick as backtick, not a special character.
[a-z][a-z0-9]*@google\.comThis is my email: test@google.comtest@google.com
  • test@google.com
TRUE'test@google.com' in 'test@google.com'
abcabc\ abcabc
  • abc
TRUE'abc' in ['abc']

Note that 'abc abc' is a single subquery(ie) after being parsed by the search query parser since the space is escaped.
(?i)(?:Abc) (no normalization)aBcdAbc
  • Abc
FALSE'aBc' in ['Abc']
(?i)(?:Abc)

normalization:
lower_case = true
aBcdAbc
  • abc
TRUE'abc' in ['abc']
(?:/?)abcbc/abc/abc/abc/
  • /abc
TRUE'/abc' in ['/abc']
(?:/?)abcabcd/abc
  • /abc
FALSE'abc' in ['/abc']
".+""cats""cats" and "dogs"
  • "cats" and "dogs"
FALSE'"cats"' in ['"cats" and "dogs"]

Note the use ofgreedy quantifiers + makes the regular expression match the longest string possible in the text, causing '"cats" and "dogs"' to be extracted as a token in the text.
".+?""cats""cats" and "dogs"
  • "cats"
  • "dogs"
TRUE'"cats"' in ['"cats"', '"dogs"]

Note the use oflazy quantifiers +? makes the regular expression match the shortest string possible in the text, causing '"cats"', '"dogs"' to be extracted as 2 separate tokens in the text.

Examples

The following examples demonstrates the use of text analysiswith customization options to create search indexes, extract tokens, and returnsearch results.

LOG_ANALYZER with NFKC ICU normalization and stop words

The following example configuresLOG_ANALYZER options withNFKC ICUnormalization and stop words. The example assumes the following data table withdata already populated:

CREATETABLEdataset.data_table(text_dataSTRING);

To create a search index with NFKC ICU normalization and a list of stop words,create a JSON-formatted string in theanalyzer_options option of theCREATESEARCH INDEX DDLstatement.For a complete list of options available in when creating a search index withtheLOG_ANALYZER, seeLOG_ANALYZER.For this example, our stop words are"the", "of", "and", "for".

CREATEORREPLACESEARCHINDEX`my_index`ON`dataset.data_table`(ALLCOLUMNS)OPTIONS(analyzer='PATTERN_ANALYZER',analyzer_options='''{    "token_filters": [      {        "normalizer": {          "mode": "ICU_NORMALIZE",          "icu_normalize_mode": "NFKC",          "icu_case_folding": true        }      },      { "stop_words": ["the", "of", "and", "for"] }    ]  }''');

Given the previous example, the following table describes the token extractionfor various values oftext_data. Note that in this document the doublequestion mark character () has been italicized to differentiate betweentwo question marks (??):

Data TextTokens for indexExplanation
The Quick Brown Fox["quick", "brown", "fox"]LOG_ANALYZER tokenization produces the tokens ["The", "Quick", "Brown", "Fox"].

Next, ICU normalization withicu_case_folding = true lower cases the tokens to produce ["the", "quick", "brown", "fox"]

Finally, the stop words filter removes "the" from the list.
The Ⓠuick Ⓑrown Ⓕox["quick", "brown", "fox"]LOG_ANALYZER tokenization produces the tokens ["The", "Ⓠuick", "Ⓑrown", "Ⓕox"].

Next, NFKC ICU normalization withicu_case_folding = true lower cases the tokens to produce ["the", "quick", "brown", "fox"]

Finally, the stop words filter removes "the" from the list.
ⓆuickⒻox["quick??fox"]LOG_ANALYZER tokenization produces the tokens ["The", "ⓆuickⒻox"].

Next, NFKC ICU normalization withicu_case_folding = true lower cases the tokens to produce ["quick??fox"]. Notice that the double question mark unicode has been normalized into 2 question mark ASCII characters.

Finally, the stop words filter does nothing because none of the tokens are in the filter list.

Now that the search index has been created, you can use theSEARCHfunction to search thetable using the same analyzer configurations specified in the search index. Notethat if the analyzer configurations in theSEARCH function don't match thoseof the search index, the search index won't be used. Use the following query:

SELECTSEARCH(analyzer=>'LOG_ANALYZER',analyzer_options=>'''{    "token_filters": [      {        "normalizer": {          "mode": "ICU_NORMALIZE",          "icu_normalize_mode": "NFKC",          "icu_case_folding": true        }      },      {        "stop_words": ["the", "of", "and", "for"]      }    ]  }''')

Replace the following:

  • search_query: The text you want to search for.

The followingtable demonstrates various results based on different search text and differentvalues ofsearch_query:

text_datasearch_queryResultExplanation
The Quick Brown Fox"Ⓠuick"TRUEThe final list of tokens extracted from the text is ["quick", "brown", "fox"].
The final list of tokens extracted from the text query is ["quick"].

The list query tokens can all be found in the text tokens.
The Ⓠuick Ⓑrown Ⓕox"quick"TRUEThe final list of tokens extracted from the text is ["quick", "brown", "fox"].
The final list of tokens extracted from the text query is ["quick"].

The list query tokens can all be found in the text tokens.
ⓆuickⒻox"quick"FALSEThe final list of tokens extracted from the text is ["quick??fox"].

The final list of tokens extracted from the text query is ["quick"].

"quick" is not in the list of tokens from the text.
ⓆuickⒻox"quickfox"TRUEThe final list of tokens extracted from the text is ["quick??fox"].

The final list of tokens extracted from the text query is ["quick??fox"].

"quick??fox" is in the list of tokens from the text.
ⓆuickⒻox"`quickfox`"FALSEInLOG_ANALYZER, backtick requires exact text match.

PATTERN_ANALYZER for IPv4 search with stop words

The following example configures thePATTERN_ANALYZER text analyzer to search for a specific pattern while filtering certain stop words. In this example, the pattern matches an IPv4 address and ignores the localhost value (127.0.0.1).

This example assumes that the following table is populated with data:

CREATETABLEdataset.data_table(text_dataSTRING);

To create a search index thepattern option and a list of stop words, create aJSON-formatted string in theanalyzer_options option of theCREATE SEARCHINDEX DDLstatement.For a complete list of options available in when creating a search index withthePATTERN_ANALYZER, seePATTERN_ANALYZER.For this example, our stop words are the localhost address,127.0.0.1.

CREATESEARCHINDEXmy_indexONdataset.data_table(text_data)OPTIONS(analyzer='PATTERN_ANALYZER',analyzer_options='''{  "patterns": [    "(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)[.]){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)"  ],  "token_filters": [    {      "stop_words": [        "127.0.0.1"      ]    }  ]}''');

When using regular expressions withanalyzer_options, include threeleading\ symbols to properly escape regular expressions that include a\ symbol, such as\d or\b.

The following table describes the tokenization options for various values oftext_data

Data TextTokens for indexExplanation
abc192.168.1.1def 172.217.20.142["192.168.1.1", "172.217.20.142"]The IPv4 patterns capture the IPv4 addresses even if there's no space between the address and the text.
104.24.12.10abc 127.0.0.1["104.24.12.10"]"127.0.0.1" is filtered out since it's in the list of stop words.

Now that the search index has been created, you can use theSEARCHfunction to search thetable based on the tokenization specified inanalyzer_options. Use thefollowing query:

SELECTSEARCH(dataset.data_table.text_data"search_data",analyzer=>'PATTERN_ANALYZER',analyzer_options=>'''{    "patterns": [      "(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)[.]){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)"      ],    "token_filters": [      {        "stop_words": [          "127.0.0.1"        ]      }    ]  }''');

Replace the following:

  • search_query: The text you want to search for.

The followingtable demonstrates various results based on different search text and differentvalues ofsearch_query:

text_datasearch_queryResultExplanation
128.0.0.2"127.0.0.1"ERRORNo search token in query.

The query goes through the text analyzer, which filters out the "127.0.0.1" token.
abc192.168.1.1def 172.217.20.142"192.168.1.1abc"TRUEThe list of tokens extracted from the query is ["192.168.1.1"].

The list of tokens extracted from text is ["192.168.1.1", "172.217.20.142"].
abc192.168.1.1def 172.217.20.142"`192.168.1.1`"TRUEThe list of tokens extracted from the query is ["192.168.1.1"].

The list of tokens extracted from text is ["192.168.1.1", "172.217.20.142"].

Note that backticks are treated as regular characters for PATTERN_ANALYZER.

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.