The ML.TF_IDF function

The term frequency-inverse document frequency (TF-IDF) reflects how important a wordis to a document in a collection or corpus. Use theML.TF_IDF function tocompute TF-IDF of terms in a document, given the precomputed inverse-documentfrequency for use in machine learning model creation.

This function uses a TF-IDF algorithm to compute the relevance of terms in a setof tokenized documents. TF-IDF multiplies two metrics: how many times a termappears in a document (term frequency), and the inverse document frequency ofthe term across a collection of documents (inverse document frequency).

  • TF-IDF:

    term frequency * inverse document frequency
  • Term frequency:

    (count of term in document) / (document size)
  • Inverse document frequency:

    log(1 + num_documents / (1 + token_document_count))

Terms are added to a dictionary of terms if they satisfy the criteria fortop_k andfrequency_threshold, otherwise they are consideredtheunknown term. The unknown term is always the first term in the dictionaryand represented as0. The rest of the dictionary is ordered alphabetically.

You can use this function with models that supportmanual feature preprocessing. For moreinformation, see the following documents:

Syntax

ML.TF_IDF(  tokenized_document  [, top_k]  [, frequency_threshold])OVER()

Arguments

ML.TF_IDF takes the following arguments:

  • tokenized_document:ARRAY<STRING> value that represents a document thathas been tokenized. A tokenized document is a collection of terms (tokens),which are used for text analysis.
  • top_k: Optional argument. Takes anINT64 value,which represents the size of the dictionary, excluding the unknown term. Thetop_k terms that appear in the most documents are added to the dictionaryuntil this threshold is met. For example, if this value is20, the top 20unique terms that appear in the most documents are added and then noadditional terms are added.
  • frequency_threshold: Optional argument. Take anINT64 value thatrepresents the minimum number of documents a term must appear in to beincluded in the dictionary. For example, if this value is3, a term mustappear in at least three documents to be added to thedictionary.

Output

ML.TF_IDF returns the input table plus the following two columns:

ARRAY<STRUCT<index INT64, value FLOAT64>>

Definitions:

  • index: The index of the term that was added to the dictionary. Unknown termshave an index of 0.

  • value: The TF-IDF computation for the term.

Quotas

SeeCloud AI service functions quotas and limits.

Example

The following example creates a tableExampleTable and applies theML.TF_IDFfunction:

WITHExampleTableAS(SELECT1ASid,['I','like','pie','pie','pie',NULL]ASfUNIONALLSELECT2ASid,['yum','yum','pie',NULL]ASfUNIONALLSELECT3ASid,['I','yum','pie',NULL]ASfUNIONALLSELECT4ASid,['you','like','pie',NULL]ASf)SELECTid,ML.TF_IDF(f,3,1)OVER()ASresultsFROMExampleTableORDERBYid;

The output is similar to the following:

+----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+| id |                                                                                     results                                                                                     |+----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+|  1 | [{"index":"0","value":"0.12679902142647365"},{"index":"1","value":"0.1412163100645339"},{"index":"2","value":"0.1412163100645339"},{"index":"3","value":"0.29389333245105953"}] ||  2 |                                                                                        [{"index":"0","value":"0.5705955964191315"},{"index":"3","value":"0.14694666622552977"}] ||  3 |                                             [{"index":"0","value":"0.380397064279421"},{"index":"1","value":"0.21182446509680086"},{"index":"3","value":"0.14694666622552977"}] ||  4 |                                             [{"index":"0","value":"0.380397064279421"},{"index":"2","value":"0.21182446509680086"},{"index":"3","value":"0.14694666622552977"}] |+----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

What's next

  • Learn more aboutTF-IDFoutside of machine learning.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-15 UTC.