Tokenization Stay organized with collections Save and categorize content based on your preferences.
Note: This feature is available with the Spanner Enterprise edition and Enterprise Plus edition. For more information, see theSpanner editions overview.
This page describes how to add tokenization to tables. Tokenization isnecessary to create the tokens that are used in thesearch index.
Tokenization is the process of transforming values into tokens. The method youuse to tokenize a document determines the types and efficiency of the searchesthat users can perform on it.
Spanner providestokenizers for natural language text,substrings, verbatim text, numbers, and booleans. The database schema uses thetokenizer that matches the type of search needed for the column. Tokenizershave the following characteristics:
- Each tokenizer is a SQL function that gets an input, such as a string or anumber, and named arguments for additional options.
- The tokenizer outputs a
TOKENLIST.
For example, a text stringThe quick brown fox jumps over the lazy dogis tokenized into[the,quick,brown,fox,jumps,over,the,lazy,dog].An HTML stringThe <b>apple</b> is <i>red</i> is tokenized into[the,apple,is,red].
Tokens have the following characteristics:
- Tokens are stored in columns that use the
TOKENLISTdata type. - Each token is stored as a sequence of bytes, with an optional set ofassociated attributes. For example, in full-text applications, a token istypically a single word from a textual document.
- When tokenizing HTML values, Spanner generates attributesthat indicate the prominence of a token within the document.Spanner uses these attributes for scoring to boost moreprominent terms (such as a heading).
Tokenizers
Spanner supports the following tokenizer functions:
Full-text tokenizer(
TOKENIZE_FULLTEXT)produces whole-word tokens for natural language queries.Example
Both of the following functions
GoogleSQL
TOKENIZE_FULLTEXT("Yellow apple")TOKENIZE_FULLTEXT("Yellow <b>apple</b>",content_type=>"text/html")PostgreSQL
This example uses
spanner.tokenize_fulltext.spanner.tokenize_fulltext("Yellow apple")spanner.tokenize_fulltext('Yellow <b>apple</b>',context_type=>'text/html')produce the same tokens:
[yellow,apple].Substring tokenizer(
TOKENIZE_SUBSTRING)generates tokens for each n-gram of each word. It's used to find substringsof words in a text.Example
GoogleSQL
TOKENIZE_SUBSTRING('hello world',ngram_size_min=>4,ngram_size_max=>6)PostgreSQL
This example uses
spanner.tokenize_substring.spanner.tokenize_substring('hello world',ngram_size_min=>4,ngram_size_max=>6)Produces the following tokens:
[ello,hell,hello,orld,worl,world].N-gram tokenizer (
TOKENIZE_NGRAMS)generates n-grams from an input (without splitting it into separate words).It is used to accelerate regular expression predicates.Example
The following function:
GoogleSQL
TOKENIZE_NGRAMS("Big Time",ngram_size_min=>4,ngram_size_max=>4)PostgreSQL
This example uses
spanner.tokenize_ngrams.spanner.tokenize_ngrams('big time',ngram_size_min=>4,ngram_size_max=>4)Produces the following tokens:
["Big ","ig T","g Ti"," Tim", "Time"].Exact match tokenizers (
TOKENandTOKENIZE_BOOL)are used to look up rows containing a certain value in one of their columns.For example, an application that indexes a products catalog might want tosearch products of a particular brand and color.Examples
The following functions:
GoogleSQL
TOKEN("hello")TOKEN(["hello","world"])PostgreSQL
This example uses
spanner.token.spanner.token('hello')Produces the following tokens:
[hello].The following function:
GoogleSQL
TOKENIZE_BOOL(true)PostgreSQL
This example uses
spanner.tokenize_bool.spanner.tokenize_bool(true)Produces the following token:
[y].Number tokenizers (
TOKENIZE_NUMBER)are used to generate a set of tokens that accelerate numeric comparisonsearches. For equality conditions, the token is the number itself. For rangeconditions (likerating >= 3.5) the set of tokens are more elaborate.Examples
The following function statements:
GoogleSQL
TOKENIZE_NUMBER(42,comparison_type=>'equality')TOKENIZE_NUMBER(42,comparison_type=>'all',granularity=>10,min=>1,max=>100)PostgreSQL
This example uses
spanner.tokenize_number.spanner.tokenize_number(42,comparison_type=>'equality')spanner.tokenize_number(42,comparison_type=>'all',granularity=>10,min=>1,max=>100)Produce the following tokens, respectively:
"==42"and"==42","[1,75]","[36, 45]","[36,55]","[36, 75]".JSON and JSONB tokenizers (
TOKENIZE_JSONandTOKENIZE_JSONBare used to generate a set of tokens that accelerate JSON containmentand key existence predicates, such asdoc[@key] IS NOT NULL(GoogleSQL) ordoc ? 'key'(PostgreSQL).
Tokenization functions are usually used in agenerated column expression. Thesecolumns are defined asHIDDEN so that they aren't included inSELECT *query results.
The following example uses a full-text tokenizer and a numeric tokenizer tocreate a database that stores music album names and ratings. The DDL statementdoes two things:
- Defines the data columns
AlbumTitleandRating. Defines
AlbumTitle_TokensandAlbumRating_Tokens. TheseTOKENLISTcolumns tokenize the values in the data columns so thatSpanner can index them.GoogleSQL
CREATETABLEAlbums(AlbumIdSTRING(MAX)NOTNULL,AlbumTitleSTRING(MAX),RatingFLOAT64,AlbumTitle_TokensTOKENLISTAS(TOKENIZE_FULLTEXT(AlbumTitle))HIDDEN,Rating_TokensTOKENLISTAS(TOKENIZE_NUMBER(Rating))HIDDEN)PRIMARYKEY(AlbumId);PostgreSQL
CREATETABLEalbums(albumidcharactervaryingNOTNULL,albumtitlecharactervarying,albumtitle_Tokensspanner.tokenlistGENERATEDALWAYSAS(spanner.tokenize_fulltext(albumtitle))VIRTUALHIDDEN,PRIMARYKEY(albumid));
Whenever the base values are modified,AlbumTitle_Tokens andRating_Tokensare automatically updated.
Tokenize plain text or HTML content
Text tokenization supports plain text and HTML content types. Use theSpannerTOKENIZE_FULLTEXTfunction to create tokens. Then use theCREATE SEARCH INDEXDDL statement to generate the search index.
For example, the followingCREATE TABLE DDL statement uses theTOKENIZE_FULLTEXT function to create tokens fromAlbumTitles in theAlbums table. TheCREATE SEARCH INDEX DDL statement creates a searchindex with the newAlbumTitles_Tokens.
GoogleSQL
CREATETABLEAlbums(AlbumIdSTRING(MAX)NOTNULL,AlbumTitleSTRING(MAX),AlbumTitle_TokensTOKENLISTAS(TOKENIZE_FULLTEXT(AlbumTitle))HIDDEN)PRIMARYKEY(AlbumId);CREATESEARCHINDEXAlbumsIndexONAlbums(AlbumTitle_Tokens)PostgreSQL
CREATETABLEalbums(albumidcharactervaryingNOTNULL,albumtitlecharactervarying,albumtitle_tokensspanner.tokenlistGENERATEDALWAYSAS(spanner.tokenize_fulltext(albumtitle))VIRTUALHIDDEN,PRIMARYKEY(albumid));CREATESEARCHINDEXalbumsindexONalbums(albumtitle_tokens)The tokenization process uses the following rules:
- Tokenization doesn't includestemming or correction ofmisspelled words. For example, in a sentence like "A cat was looking at agroup of cats", the token "cat" is indexed separately from the token "cats".Compared to other search engines that normalize tokens during writes,Spanner provides an option to expand the search query toinclude different forms of words. For more information, seeEnhanced query mode.
- Stopwords (like "a") are included in the search index.
- Full-text search is always case insensitive. The tokenization processconverts all tokens to lowercase.
The tokenization process tracks the positions for each token in the originaltext. These positions are later used to match phrases. The positions are storedin the search index alongside docids.
Google continues to improve tokenization algorithms. In some cases, this mightlead to a string getting tokenized differently in the future from the way it istokenized now. We expect such cases to be extremely rare. An example of this isif there's an improvement in the Chinese, Japanese, and Korean (CJK) languagesegmentation.
Thecontent_type argument specifies whether the content format uses plaintext or HTML. Use the following settings to set thecontent_type:
- For text tokenization, set the
content_typeargument to "text/plain".This is the default setting. - For HTML tokenization, set the
content_typeargument to"text/html".Without this argument, HTML tags are treated as punctuation. In HTML mode,Spanner uses heuristics to infer how prominent the text is onthe page. For example, whether the text is in a heading or its font size.The supported attributes for HTML includesmall,medium,large,title, and `link'. Like position, the attribute is stored alongside thetoken in the search index. Tokenization doesn't create tokens for any HTMLtags.
Token attributes don't impact matching or the results of theSEARCH orSEARCH_SUBSTRING function. They're only used forranking.
The following example shows how to tokenize text:
GoogleSQL
CREATETABLET(...TextSTRING(MAX),HtmlSTRING(MAX),Text_TokensTOKENLISTAS(TOKENIZE_FULLTEXT(Text,content_type=>"text/plain"))HIDDEN,Html_TokensTOKENLISTAS(TOKENIZE_FULLTEXT(Html,content_type=>"text/html"))HIDDEN)PRIMARYKEY(...);PostgreSQL
CREATETABLEt(...textcharactervarying,htmlcharactervarying,text_tokensspanner.tokenlistGENERATEDALWAYSAS(spanner.tokenize_fulltext(text,content_type=>"text/plain"))VIRTUALHIDDEN,html_tokensspanner.tokenlistGENERATEDALWAYSAS(spanner.tokenize_fulltext(html,content_type=>'type/html'))VIRTUALHIDDEN,PRIMARYKEY(...));Language detection refinement with thelanguage_tag argument
Tokenization detects the input language automatically, by default. When theinput language is known, alanguage_tag argument can be used torefine this behavior:
GoogleSQL
AlbumTitle_TokensTOKENLISTAS(TOKENIZE_FULLTEXT(AlbumTitle,language_tag=>"en-us"))HIDDENPostgreSQL
albumtitle_tokensspanner.tokenlistGENERATEDALWAYSAS(spanner.tokenize_fulltext(albumtitle,language_tag=>'en-us'))VIRTUALHIDDENMost applications leave thelanguage_tag argument unspecified and instead relyon automatic language detection. Segmentation for Asian languages like Chinese,Korean, and Japanese doesn't require setting the tokenization language.
The following examples show cases where thelanguage_tag affects tokenization:
| Tokenization function | Produced tokens |
|---|---|
TOKENIZE_FULLTEXT("A tout pourquoi il y a un parce que") | [a, tout, pourquoi, il, ya, un, parce, que] |
TOKENIZE_FULLTEXT("A tout pourquoi il y a un parce que", \ language_tag=>"fr") | [a, tout, pourquoi, il, y, a, un, parce, que] |
TOKENIZE_FULLTEXT("旅 行") | Two tokens: [旅, 行] |
TOKENIZE_FULLTEXT("旅 行", language_tag=>"zh") | One token: [旅行] |
What's next
- Learn aboutsearch indexes.
- Learn aboutnumeric indexes.
- Learn aboutindex partitioning.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-12-17 UTC.