bigframes.ml.preprocessing.OneHotEncoder#
- classbigframes.ml.preprocessing.OneHotEncoder(drop:Literal['most_frequent']|None=None,min_frequency:int|None=None,max_categories:int|None=None)[source]#
Encode categorical features as a one-hot format.
The input to this transformer should be an array-like of integers orstrings, denoting the values taken on by categorical (discrete) features.The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’)encoding scheme.
Note that this method deviates from Scikit-Learn; instead of producing sparsebinary columns, the encoding is a single column ofSTRUCT<index INT64, value DOUBLE>.
Examples:
Given a dataset with two features, we let the encoder find the uniquevalues per feature and transform the data to a binary one-hot encoding.
>>>frombigframes.ml.preprocessingimportOneHotEncoder>>>importbigframes.pandasasbpd
>>>enc=OneHotEncoder()>>>X=bpd.DataFrame({"a":["Male","Female","Female"],"b":["1","3","2"]})>>>enc.fit(X)OneHotEncoder()
>>>print(enc.transform(bpd.DataFrame({"a":["Female","Male"],"b":["1","4"]}))) onehotencoded_a onehotencoded_b0 [{'index': 1, 'value': 1.0}] [{'index': 1, 'value': 1.0}]1 [{'index': 2, 'value': 1.0}] [{'index': 0, 'value': 1.0}][2 rows x 2 columns]
- Parameters:
drop (Optional[Literal["most_frequent"]],default None) – Specifies a methodology to use to drop one of the categories per feature.This is useful in situations where perfectly collinear features cause problems,such as when feeding the resulting data into an unregularized linear regression model.However, dropping one category breaks the symmetry of the original representationand can therefore induce a bias in downstream models, for instance for penalizedlinear classification or regression models.Default None: retain all the categories.“most_frequent”: Drop the most frequent category found in the string expression.Selecting this value causes the function to use dummy encoding.
min_frequency (Optional[int],default None) – Specifies the minimum frequency below which a category will be considered infrequent.Default None.int: categories with a smaller cardinality will be considered infrequent as index 0.
max_categories (Optional[int],default None) – Specifies an upper limit to the number of output features for each input featurewhen considering infrequent categories. If there are infrequent categories,max_categories includes the category representing the infrequent categories along with the frequent categories.Default None. Set limit to 1,000,000.
Attributes
Methods
__init__([drop, min_frequency, max_categories])fit(X[, y])Fit OneHotEncoder to X.
fit_transform(X[, y])get_params([deep])Get parameters for this estimator.
to_gbq(model_name[, replace])Save the transformer as a BigQuery model.
transform(X)Transform X using one-hot encoding.