The ML.BAG_OF_WORDS function

Use theML.BAG_OF_WORDS function to compute a representation of tokenizeddocuments as the bag (multiset) of its words, disregarding word ordering andgrammar.

You can use this function with models that supportmanual feature preprocessing. For moreinformation, see the following documents:

Syntax

ML.BAG_OF_WORDS(  tokenized_document  [, top_k]  [, frequency_threshold])OVER()

Arguments

ML.BAG_OF_WORDS takes the following arguments:

  • tokenized_document:ARRAY<STRING> value that represents a document thathas been tokenized. A tokenized document is a collection of terms (tokens),which are used for text analysis. For more information about tokenization in BigQuery, seeTEXT_ANALYZE.
  • top_k: Optional argument. Takes anINT64 value,which represents the size of the dictionary, excluding the unknown term. Thetop_k terms that appear in the most documents are added to the dictionaryuntil this threshold is met. For example, if this value is20, the top 20unique terms that appear in the most documents are added and then noadditional terms are added.
  • frequency_threshold: Optional argument. Takes anINT64 value thatrepresents the minimum number of documents a term must appear in to beincluded in the dictionary. For example, if this value is3, a term mustappear at least three times in the tokenized document to be added to thedictionary.

Terms are added to a dictionary of terms if they satisfy the criteria fortop_k andfrequency_threshold, otherwise they are consideredtheunknown term. The unknown term is always the first term in the dictionaryand represented as0. The rest of the dictionary is ordered alphabetically.

Output

ML.BAG_OF_WORDS returns a value for every row in the input. Each value has thefollowing type:

ARRAY<STRUCT<index INT64, value FLOAT64>>

Definitions:

  • index: The index of the term that was added to the dictionary. Unknown termshave an index of0.
  • value: The corresponding counts in the document.

Quotas

SeeCloud AI service functions quotas and limits.

Example

The following example calls theML.BAG_OF_WORDS function on an input columnf, with no unknown terms:

WITHExampleTableAS(SELECT1ASid,['a','b','b','c']ASfUNIONALLSELECT2ASid,['a','c']ASf)SELECTML.BAG_OF_WORDS(f,32,1)OVER()ASresultsFROMExampleTableORDERBYid;

The output is similar to the following:

+----+---------------------------------------------------------------------------------------+| id |                                        results                                        |+----+---------------------------------------------------------------------------------------+|  1 | [{"index":"1","value":"1.0"},{"index":"2","value":"2.0"},{"index":"3","value":"1.0"}] ||  2 |                             [{"index":"1","value":"1.0"},{"index":"3","value":"1.0"}] |+----+---------------------------------------------------------------------------------------+

Notice that there is no index0 in the result, as there are no unknown terms.

The following example calls theML.BAG_OF_WORDS function on an input columnf:

WITHExampleTableAS(SELECT1ASid,['a','b','b','b','c','c','c','c','d','d']ASfUNIONALLSELECT2ASid,['a','c',NULL]ASf)SELECTML.BAG_OF_WORDS(f,4,2)OVER()ASresultsFROMExampleTableORDERBYid;

The output is similar to the following:

+----+---------------------------------------------------------------------------------------+| id |                                        results                                        |+----+---------------------------------------------------------------------------------------+|  1 | [{"index":"0","value":"5.0"},{"index":"1","value":"1.0"},{"index":"2","value":"4.0"}] ||  2 | [{"index":"0","value":"1.0"},{"index":"1","value":"1.0"},{"index":"2","value":"1.0"}] |+----+---------------------------------------------------------------------------------------+

Notice that the values forb andd are not returned as they appear in only one document when the value offrequency_threshold is set to2.

The following example calls theML.BAG_OF_WORDS function with a lower value oftop_k:

WITHExampleTableAS(SELECT1ASid,['a','b','b','c']ASfUNIONALLSELECT2ASid,['a','c','c']ASf)SELECTML.BAG_OF_WORDS(f,2,1)OVER()ASresultsFROMExampleTableORDERBYid;

The output is similar to the following:

+----+---------------------------------------------------------------------------------------+| id |                                        results                                        |+----+---------------------------------------------------------------------------------------+|  1 | [{"index":"0","value":"2.0"},{"index":"1","value":"1.0"},{"index":"2","value":"1.0"}] ||  2 |                             [{"index":"1","value":"1.0"},{"index":"2","value":"2.0"}] |+----+---------------------------------------------------------------------------------------+

Notice how the value forb is not returned since we specify we want the top two terms, andb only appears in one document.

The following example contains two terms with the same frequency. One of the terms is excluded from the results due to the alphabetical order.

WITHExampleDataAS(SELECT1ASid,['a','b','b','c','d','d','d']asfUNIONALLSELECT2ASid,['a','c','c','d','d','d']asf)SELECTid,ML.BAG_OF_WORDS(f,2,2)OVER()asresultFROMExampleDataORDERBYid;

The results look like the following:

+----+---------------------------------------------------------------------------------------+| id |                                         result                                        |+----+---------------------------------------------------------------------------------------+|  1 | [{"index":"0","value":"5.0"},{"index":"1","value":"1.0"},{"index":"2","value":"1.0"}] ||  2 | [{"index":"0","value":"3.0"},{"index":"1","value":"1.0"},{"index":"2","value":"2.0"}] |+----+---------------------------------------------------------------------------------------+

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.