StringIndexer #

classpyspark.ml.feature.StringIndexer(*,inputCol=None,outputCol=None,inputCols=None,outputCols=None,handleInvalid='error',stringOrderType='frequencyDesc')[source]#

A label indexer that maps a string column of labels to an ML column of label indices.If the input column is numeric, we cast it to string and index the string values.The indices are in [0, numLabels). By default, this is ordered by label frequenciesso the most frequent label gets index 0. The ordering behavior is controlled bysettingstringOrderType. Its default value is ‘frequencyDesc’.

New in version 1.4.0.

Examples

>>>stringIndexer=StringIndexer(inputCol="label",outputCol="indexed",...stringOrderType="frequencyDesc")>>>stringIndexer.setHandleInvalid("error")StringIndexer...>>>model=stringIndexer.fit(stringIndDf)>>>model.setHandleInvalid("error")StringIndexerModel...>>>td=model.transform(stringIndDf)>>>sorted(set([(i[0],i[1])foriintd.select(td.id,td.indexed).collect()]),...key=lambdax:x[0])[(0, 0.0), (1, 2.0), (2, 1.0), (3, 0.0), (4, 0.0), (5, 1.0)]>>>inverter=IndexToString(inputCol="indexed",outputCol="label2",labels=model.labels)>>>itd=inverter.transform(td)>>>sorted(set([(i[0],str(i[1]))foriinitd.select(itd.id,itd.label2).collect()]),...key=lambdax:x[0])[(0, 'a'), (1, 'b'), (2, 'c'), (3, 'a'), (4, 'a'), (5, 'c')]>>>stringIndexerPath=temp_path+"/string-indexer">>>stringIndexer.save(stringIndexerPath)>>>loadedIndexer=StringIndexer.load(stringIndexerPath)>>>loadedIndexer.getHandleInvalid()==stringIndexer.getHandleInvalid()True>>>modelPath=temp_path+"/string-indexer-model">>>model.save(modelPath)>>>loadedModel=StringIndexerModel.load(modelPath)>>>loadedModel.labels==model.labelsTrue>>>indexToStringPath=temp_path+"/index-to-string">>>inverter.save(indexToStringPath)>>>loadedInverter=IndexToString.load(indexToStringPath)>>>loadedInverter.getLabels()==inverter.getLabels()True>>>loadedModel.transform(stringIndDf).take(1)==model.transform(stringIndDf).take(1)True>>>stringIndexer.getStringOrderType()'frequencyDesc'>>>stringIndexer=StringIndexer(inputCol="label",outputCol="indexed",handleInvalid="error",...stringOrderType="alphabetDesc")>>>model=stringIndexer.fit(stringIndDf)>>>td=model.transform(stringIndDf)>>>sorted(set([(i[0],i[1])foriintd.select(td.id,td.indexed).collect()]),...key=lambdax:x[0])[(0, 2.0), (1, 1.0), (2, 0.0), (3, 2.0), (4, 2.0), (5, 0.0)]>>>fromlabelsModel=StringIndexerModel.from_labels(["a","b","c"],...inputCol="label",outputCol="indexed",handleInvalid="error")>>>result=fromlabelsModel.transform(stringIndDf)>>>sorted(set([(i[0],i[1])foriinresult.select(result.id,result.indexed).collect()]),...key=lambdax:x[0])[(0, 0.0), (1, 1.0), (2, 2.0), (3, 0.0), (4, 0.0), (5, 2.0)]>>>testData=sc.parallelize([Row(id=0,label1="a",label2="e"),...Row(id=1,label1="b",label2="f"),...Row(id=2,label1="c",label2="e"),...Row(id=3,label1="a",label2="f"),...Row(id=4,label1="a",label2="f"),...Row(id=5,label1="c",label2="f")],3)>>>multiRowDf=spark.createDataFrame(testData)>>>inputs=["label1","label2"]>>>outputs=["index1","index2"]>>>stringIndexer=StringIndexer(inputCols=inputs,outputCols=outputs)>>>model=stringIndexer.fit(multiRowDf)>>>result=model.transform(multiRowDf)>>>sorted(set([(i[0],i[1],i[2])foriinresult.select(result.id,result.index1,...result.index2).collect()]),key=lambdax:x[0])[(0, 0.0, 1.0), (1, 2.0, 0.0), (2, 1.0, 1.0), (3, 0.0, 0.0), (4, 0.0, 0.0), (5, 1.0, 0.0)]>>>fromlabelsModel=StringIndexerModel.from_arrays_of_labels([["a","b","c"],["e","f"]],...inputCols=inputs,outputCols=outputs)>>>result=fromlabelsModel.transform(multiRowDf)>>>sorted(set([(i[0],i[1],i[2])foriinresult.select(result.id,result.index1,...result.index2).collect()]),key=lambdax:x[0])[(0, 0.0, 0.0), (1, 1.0, 1.0), (2, 2.0, 0.0), (3, 0.0, 1.0), (4, 0.0, 1.0), (5, 2.0, 1.0)]

Methods

`clear`(param)	Clears a param from the param map if it has been explicitly set.
`copy`([extra])	Creates a copy of this instance with the same uid and some extra params.
`explainParam`(param)	Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
`explainParams`()	Returns the documentation of all params with their optionally default values and user-supplied values.
`extractParamMap`([extra])	Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
`fit`(dataset[, params])	Fits a model to the input dataset with optional parameters.
`fitMultiple`(dataset, paramMaps)	Fits a model to the input dataset for each param map inparamMaps.
`getHandleInvalid`()	Gets the value of handleInvalid or its default value.
`getInputCol`()	Gets the value of inputCol or its default value.
`getInputCols`()	Gets the value of inputCols or its default value.
`getOrDefault`(param)	Gets the value of a param in the user-supplied param map or its default value.
`getOutputCol`()	Gets the value of outputCol or its default value.
`getOutputCols`()	Gets the value of outputCols or its default value.
`getParam`(paramName)	Gets a param by its name.
`getStringOrderType`()	Gets the value of`stringOrderType` or its default value 'frequencyDesc'.
`hasDefault`(param)	Checks whether a param has a default value.
`hasParam`(paramName)	Tests whether this instance contains a param with a given (string) name.
`isDefined`(param)	Checks whether a param is explicitly set by user or has a default value.
`isSet`(param)	Checks whether a param is explicitly set by user.
`load`(path)	Reads an ML instance from the input path, a shortcut ofread().load(path).
`read`()	Returns an MLReader instance for this class.
`save`(path)	Save this ML instance to the given path, a shortcut of 'write().save(path)'.
`set`(param, value)	Sets a parameter in the embedded param map.
`setHandleInvalid`(value)	Sets the value of`handleInvalid`.
`setInputCol`(value)	Sets the value of`inputCol`.
`setInputCols`(value)	Sets the value of`inputCols`.
`setOutputCol`(value)	Sets the value of`outputCol`.
`setOutputCols`(value)	Sets the value of`outputCols`.
`setParams`(self, \*[, inputCol, outputCol, ...])	Sets params for this StringIndexer.
`setStringOrderType`(value)	Sets the value of`stringOrderType`.
`write`()	Returns an MLWriter instance for this ML instance.

Attributes

`handleInvalid`
`inputCol`
`inputCols`
`outputCol`
`outputCols`
`params`	Returns all params ordered by name.
`stringOrderType`

Methods Documentation

clear(param)#: Clears a param from the param map if it has been explicitly set.

copy(extra=None)#

Creates a copy of this instance with the same uid and someextra params. This implementation first calls Params.copy andthen make a copy of the companion Java pipeline component withextra params. So both the Python wrapper and the Java pipelinecomponent get copied.

Parameters

extradict, optional: Extra parameters to copy to the new instance

Returns

JavaParams: Copy of this instance

explainParam(param)#: Explains a single param and returns its name, doc, and optionaldefault value and user-supplied value in a string.

explainParams()#: Returns the documentation of all params with their optionallydefault values and user-supplied values.

extractParamMap(extra=None)#

Extracts the embedded default param values and user-suppliedvalues, and then merges them with extra values from input intoa flat param map, where the latter value is used if there existconflicts, i.e., with ordering: default param values <user-supplied values < extra.

Parameters

extradict, optional: extra param values

Returns

dict: merged param map

fit(dataset,params=None)#

Fits a model to the input dataset with optional parameters.

New in version 1.3.0.

Parameters

datasetpyspark.sql.DataFrame: input dataset.
paramsdict or list or tuple, optional: an optional param map that overrides embedded params. If a list/tuple ofparam maps is given, this calls fit on each param map and returns a list ofmodels.

Returns

Transformer or a list ofTransformer: fitted model(s)

fitMultiple(dataset,paramMaps)#

Fits a model to the input dataset for each param map inparamMaps.

New in version 2.3.0.

Parameters

datasetpyspark.sql.DataFrame: input dataset.
paramMapscollections.abc.Sequence: A Sequence of param maps.

Returns

_FitMultipleIterator: A thread safe iterable which contains one model for each param map. Eachcall tonext(modelIterator) will return(index, model) where model was fitusingparamMaps[index].index values may not be sequential.

getHandleInvalid()#: Gets the value of handleInvalid or its default value.

getInputCol()#: Gets the value of inputCol or its default value.

getInputCols()#: Gets the value of inputCols or its default value.

getOrDefault(param)#: Gets the value of a param in the user-supplied param map or itsdefault value. Raises an error if neither is set.

getOutputCol()#: Gets the value of outputCol or its default value.

getOutputCols()#: Gets the value of outputCols or its default value.

getParam(paramName)#: Gets a param by its name.

getStringOrderType()#: Gets the value ofstringOrderType or its default value ‘frequencyDesc’.
New in version 2.3.0.

hasDefault(param)#: Checks whether a param has a default value.

hasParam(paramName)#: Tests whether this instance contains a param with a given(string) name.

isDefined(param)#: Checks whether a param is explicitly set by user or hasa default value.

isSet(param)#: Checks whether a param is explicitly set by user.

classmethodload(path)#: Reads an ML instance from the input path, a shortcut ofread().load(path).

classmethodread()#: Returns an MLReader instance for this class.

save(path)#: Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param,value)#: Sets a parameter in the embedded param map.

setHandleInvalid(value)[source]#: Sets the value ofhandleInvalid.

setInputCol(value)[source]#: Sets the value ofinputCol.

setInputCols(value)[source]#: Sets the value ofinputCols.
New in version 3.0.0.

setOutputCol(value)[source]#: Sets the value ofoutputCol.

setOutputCols(value)[source]#: Sets the value ofoutputCols.
New in version 3.0.0.

setParams(self,\*,inputCol=None,outputCol=None,inputCols=None,outputCols=None,handleInvalid="error",stringOrderType="frequencyDesc")[source]#: Sets params for this StringIndexer.
New in version 1.4.0.

setStringOrderType(value)[source]#: Sets the value ofstringOrderType.
New in version 2.3.0.

write()#: Returns an MLWriter instance for this ML instance.

Attributes Documentation

handleInvalid=Param(parent='undefined',name='handleInvalid',doc="howtohandleinvaliddata(unseenorNULLvalues)infeaturesandlabelcolumnofstringtype.Optionsare'skip'(filteroutrowswithinvaliddata),error(throwanerror),or'keep'(putinvaliddatainaspecialadditionalbucket,atindexnumLabels).")#

inputCol=Param(parent='undefined',name='inputCol',doc='inputcolumnname.')#

inputCols=Param(parent='undefined',name='inputCols',doc='inputcolumnnames.')#

outputCol=Param(parent='undefined',name='outputCol',doc='outputcolumnname.')#

outputCols=Param(parent='undefined',name='outputCols',doc='outputcolumnnames.')#

params#: Returns all params ordered by name. The default implementationusesdir() to get all attributes of typeParam.

stringOrderType=Param(parent='undefined',name='stringOrderType',doc='Howtoorderlabelsofstringcolumn.Thefirstlabelafterorderingisassignedanindexof0.Supportedoptions:frequencyDesc,frequencyAsc,alphabetDesc,alphabetAsc.DefaultisfrequencyDesc.IncaseofequalfrequencywhenunderfrequencyDesc/Asc,thestringsarefurthersortedalphabetically')#

uid#: A unique id for the object.

Show Source

Movatterモバイル変換

StringIndexer#

StringIndexer #