The ML.MULTI_HOT_ENCODER function

This document describes theML.MULTI_HOT_ENCODER function, which lets youencode a string array expression by using amulti-hotencoding scheme.

The encoding vocabulary is sorted alphabetically.NULL values and categoriesthat aren't in the vocabulary are encoded with anindex value of0.

When used in theTRANSFORM clause,the vocabulary calculated during training, along with the topk and frequencythreshold values that you specified, are automatically used in prediction.

You can use this function with models that supportmanual feature preprocessing. For moreinformation, see the following documents:

Syntax

ML.MULTI_HOT_ENCODER(array_expression [, top_k] [, frequency_threshold]) OVER()

Arguments

ML.MULTI_HOT_ENCODER takes the following arguments:

  • array_expression: theARRAY<STRING> expression to encode.
  • top_k: anINT64 value that specifies the number of categoriesincluded in the encoding vocabulary. The function selects thetop_kmost frequent categories in the data and uses those; categories below thisthreshold are encoded to0. This value must be less than1,000,000to avoid problems due to high dimensionality. The default value is32,000.
  • frequency_threshold: anINT64 value that limits the categoriesincluded in the encoding vocabulary based on category frequency. Thefunction uses categories whose frequency is greater than or equal tofrequency_threshold; categories below this threshold are encoded to0.The default value is5.

Output

ML.MULTI_HOT_ENCODER returns an array of struct values in the formARRAY<STRUCT<INT64, FLOAT64>>. The first element in the struct provides theindex of the encoded string expression, and the second element provides thevalue of the encoded string expression.

Example

The following example performs multi-hot encoding on a set of string arrayexpressions. It limits the encoding vocabulary to the three categories thatoccur the most frequently in the data and that also occur one or more times.

SELECTf[OFFSET(0)]ASf0,ML.MULTI_HOT_ENCODER(f,3,1)OVER()ASoutputFROM(SELECT['a','b','b','c',NULL]ASfUNIONALLSELECT['c','c','d','d',NULL]ASf)ORDERBYf[OFFSET(0)];

The output looks similar to the following:

+------+-----------------------------+|  f0  | output.index | output.value |+------+--------------+--------------+|  a   |  1           |  1.0         ||      |  2           |  1.0         ||      |  3           |  1.0         ||      |  0           |  1.0         ||  c   |  3           |  1.0         ||      |  0           |  1.0         |+------+-----------------------------+

What's next

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-11-24 UTC.