Tokenization

Note: This feature is available with the Spanner Enterprise edition and Enterprise Plus edition. For more information, see theSpanner editions overview.

This page describes how to add tokenization to tables. Tokenization isnecessary to create the tokens that are used in thesearch index.

Tokenization is the process of transforming values into tokens. The method youuse to tokenize a document determines the types and efficiency of the searchesthat users can perform on it.

Spanner providestokenizers for natural language text,substrings, verbatim text, numbers, and booleans. The database schema uses thetokenizer that matches the type of search needed for the column. Tokenizershave the following characteristics:

  • Each tokenizer is a SQL function that gets an input, such as a string or anumber, and named arguments for additional options.
  • The tokenizer outputs aTOKENLIST.

For example, a text stringThe quick brown fox jumps over the lazy dogis tokenized into[the,quick,brown,fox,jumps,over,the,lazy,dog].An HTML stringThe <b>apple</b> is <i>red</i> is tokenized into[the,apple,is,red].

Tokens have the following characteristics:

  • Tokens are stored in columns that use theTOKENLIST data type.
  • Each token is stored as a sequence of bytes, with an optional set ofassociated attributes. For example, in full-text applications, a token istypically a single word from a textual document.
  • When tokenizing HTML values, Spanner generates attributesthat indicate the prominence of a token within the document.Spanner uses these attributes for scoring to boost moreprominent terms (such as a heading).

Tokenizers

Spanner supports the following tokenizer functions:

  • Full-text tokenizer(TOKENIZE_FULLTEXT)produces whole-word tokens for natural language queries.

    Example

    Both of the following functions

    GoogleSQL

    TOKENIZE_FULLTEXT("Yellow apple")TOKENIZE_FULLTEXT("Yellow <b>apple</b>",content_type=>"text/html")

    PostgreSQL

    This example usesspanner.tokenize_fulltext.

    spanner.tokenize_fulltext("Yellow apple")spanner.tokenize_fulltext('Yellow <b>apple</b>',context_type=>'text/html')

    produce the same tokens:[yellow,apple].

  • Substring tokenizer(TOKENIZE_SUBSTRING)generates tokens for each n-gram of each word. It's used to find substringsof words in a text.

    Example

    GoogleSQL

    TOKENIZE_SUBSTRING('hello world',ngram_size_min=>4,ngram_size_max=>6)

    PostgreSQL

    This example usesspanner.tokenize_substring.

    spanner.tokenize_substring('hello world',ngram_size_min=>4,ngram_size_max=>6)

    Produces the following tokens:[ello,hell,hello,orld,worl,world].

  • N-gram tokenizer (TOKENIZE_NGRAMS)generates n-grams from an input (without splitting it into separate words).It is used to accelerate regular expression predicates.

    Example

    The following function:

    GoogleSQL

    TOKENIZE_NGRAMS("Big Time",ngram_size_min=>4,ngram_size_max=>4)

    PostgreSQL

    This example usesspanner.tokenize_ngrams.

    spanner.tokenize_ngrams('big time',ngram_size_min=>4,ngram_size_max=>4)

    Produces the following tokens:["Big ","ig T","g Ti"," Tim", "Time"].

  • Exact match tokenizers (TOKENandTOKENIZE_BOOL)are used to look up rows containing a certain value in one of their columns.For example, an application that indexes a products catalog might want tosearch products of a particular brand and color.

    Examples

    The following functions:

    GoogleSQL

    TOKEN("hello")TOKEN(["hello","world"])

    PostgreSQL

    This example usesspanner.token.

    spanner.token('hello')

    Produces the following tokens:[hello].

    The following function:

    GoogleSQL

    TOKENIZE_BOOL(true)

    PostgreSQL

    This example usesspanner.tokenize_bool.

    spanner.tokenize_bool(true)

    Produces the following token:[y].

  • Number tokenizers (TOKENIZE_NUMBER)are used to generate a set of tokens that accelerate numeric comparisonsearches. For equality conditions, the token is the number itself. For rangeconditions (likerating >= 3.5) the set of tokens are more elaborate.

    Examples

    The following function statements:

    GoogleSQL

    TOKENIZE_NUMBER(42,comparison_type=>'equality')TOKENIZE_NUMBER(42,comparison_type=>'all',granularity=>10,min=>1,max=>100)

    PostgreSQL

    This example usesspanner.tokenize_number.

    spanner.tokenize_number(42,comparison_type=>'equality')spanner.tokenize_number(42,comparison_type=>'all',granularity=>10,min=>1,max=>100)

    Produce the following tokens, respectively:"==42" and"==42","[1,75]","[36, 45]","[36,55]","[36, 75]".

  • JSON and JSONB tokenizers (TOKENIZE_JSONandTOKENIZE_JSONBare used to generate a set of tokens that accelerate JSON containmentand key existence predicates, such asdoc[@key] IS NOT NULL(GoogleSQL) ordoc ? 'key' (PostgreSQL).

Tokenization functions are usually used in agenerated column expression. Thesecolumns are defined asHIDDEN so that they aren't included inSELECT *query results.

The following example uses a full-text tokenizer and a numeric tokenizer tocreate a database that stores music album names and ratings. The DDL statementdoes two things:

  1. Defines the data columnsAlbumTitle andRating.
  2. DefinesAlbumTitle_Tokens andAlbumRating_Tokens. TheseTOKENLISTcolumns tokenize the values in the data columns so thatSpanner can index them.

    GoogleSQL

    CREATETABLEAlbums(AlbumIdSTRING(MAX)NOTNULL,AlbumTitleSTRING(MAX),RatingFLOAT64,AlbumTitle_TokensTOKENLISTAS(TOKENIZE_FULLTEXT(AlbumTitle))HIDDEN,Rating_TokensTOKENLISTAS(TOKENIZE_NUMBER(Rating))HIDDEN)PRIMARYKEY(AlbumId);

    PostgreSQL

    CREATETABLEalbums(albumidcharactervaryingNOTNULL,albumtitlecharactervarying,albumtitle_Tokensspanner.tokenlistGENERATEDALWAYSAS(spanner.tokenize_fulltext(albumtitle))VIRTUALHIDDEN,PRIMARYKEY(albumid));

Whenever the base values are modified,AlbumTitle_Tokens andRating_Tokensare automatically updated.

Tokenize plain text or HTML content

Text tokenization supports plain text and HTML content types. Use theSpannerTOKENIZE_FULLTEXTfunction to create tokens. Then use theCREATE SEARCH INDEXDDL statement to generate the search index.

For example, the followingCREATE TABLE DDL statement uses theTOKENIZE_FULLTEXT function to create tokens fromAlbumTitles in theAlbums table. TheCREATE SEARCH INDEX DDL statement creates a searchindex with the newAlbumTitles_Tokens.

GoogleSQL

CREATETABLEAlbums(AlbumIdSTRING(MAX)NOTNULL,AlbumTitleSTRING(MAX),AlbumTitle_TokensTOKENLISTAS(TOKENIZE_FULLTEXT(AlbumTitle))HIDDEN)PRIMARYKEY(AlbumId);CREATESEARCHINDEXAlbumsIndexONAlbums(AlbumTitle_Tokens)

PostgreSQL

CREATETABLEalbums(albumidcharactervaryingNOTNULL,albumtitlecharactervarying,albumtitle_tokensspanner.tokenlistGENERATEDALWAYSAS(spanner.tokenize_fulltext(albumtitle))VIRTUALHIDDEN,PRIMARYKEY(albumid));CREATESEARCHINDEXalbumsindexONalbums(albumtitle_tokens)

The tokenization process uses the following rules:

  • Tokenization doesn't includestemming or correction ofmisspelled words. For example, in a sentence like "A cat was looking at agroup of cats", the token "cat" is indexed separately from the token "cats".Compared to other search engines that normalize tokens during writes,Spanner provides an option to expand the search query toinclude different forms of words. For more information, seeEnhanced query mode.
  • Stopwords (like "a") are included in the search index.
  • Full-text search is always case insensitive. The tokenization processconverts all tokens to lowercase.

The tokenization process tracks the positions for each token in the originaltext. These positions are later used to match phrases. The positions are storedin the search index alongside docids.

Google continues to improve tokenization algorithms. In some cases, this mightlead to a string getting tokenized differently in the future from the way it istokenized now. We expect such cases to be extremely rare. An example of this isif there's an improvement in the Chinese, Japanese, and Korean (CJK) languagesegmentation.

Thecontent_type argument specifies whether the content format uses plaintext or HTML. Use the following settings to set thecontent_type:

  • For text tokenization, set thecontent_type argument to "text/plain".This is the default setting.
  • For HTML tokenization, set thecontent_type argument to"text/html".Without this argument, HTML tags are treated as punctuation. In HTML mode,Spanner uses heuristics to infer how prominent the text is onthe page. For example, whether the text is in a heading or its font size.The supported attributes for HTML includesmall,medium,large,title, and `link'. Like position, the attribute is stored alongside thetoken in the search index. Tokenization doesn't create tokens for any HTMLtags.

Token attributes don't impact matching or the results of theSEARCH orSEARCH_SUBSTRING function. They're only used forranking.

The following example shows how to tokenize text:

GoogleSQL

CREATETABLET(...TextSTRING(MAX),HtmlSTRING(MAX),Text_TokensTOKENLISTAS(TOKENIZE_FULLTEXT(Text,content_type=>"text/plain"))HIDDEN,Html_TokensTOKENLISTAS(TOKENIZE_FULLTEXT(Html,content_type=>"text/html"))HIDDEN)PRIMARYKEY(...);

PostgreSQL

CREATETABLEt(...textcharactervarying,htmlcharactervarying,text_tokensspanner.tokenlistGENERATEDALWAYSAS(spanner.tokenize_fulltext(text,content_type=>"text/plain"))VIRTUALHIDDEN,html_tokensspanner.tokenlistGENERATEDALWAYSAS(spanner.tokenize_fulltext(html,content_type=>'type/html'))VIRTUALHIDDEN,PRIMARYKEY(...));

Language detection refinement with thelanguage_tag argument

Tokenization detects the input language automatically, by default. When theinput language is known, alanguage_tag argument can be used torefine this behavior:

GoogleSQL

AlbumTitle_TokensTOKENLISTAS(TOKENIZE_FULLTEXT(AlbumTitle,language_tag=>"en-us"))HIDDEN

PostgreSQL

albumtitle_tokensspanner.tokenlistGENERATEDALWAYSAS(spanner.tokenize_fulltext(albumtitle,language_tag=>'en-us'))VIRTUALHIDDEN

Most applications leave thelanguage_tag argument unspecified and instead relyon automatic language detection. Segmentation for Asian languages like Chinese,Korean, and Japanese doesn't require setting the tokenization language.

The following examples show cases where thelanguage_tag affects tokenization:

Tokenization functionProduced tokens
TOKENIZE_FULLTEXT("A tout pourquoi il y a un parce que")[a, tout, pourquoi, il, ya, un, parce, que]
TOKENIZE_FULLTEXT("A tout pourquoi il y a un parce que", \ language_tag=>"fr")[a, tout, pourquoi, il, y, a, un, parce, que]
TOKENIZE_FULLTEXT("旅 行")Two tokens: [旅, 行]
TOKENIZE_FULLTEXT("旅 行", language_tag=>"zh")One token: [旅行]

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-17 UTC.