- API reference
- Index objects
- pandas.Index...
pandas.Index.factorize#
- Index.factorize(sort=False,use_na_sentinel=True)[source]#
Encode the object as an enumerated type or categorical variable.
This method is useful for obtaining a numeric representation of anarray when all that matters is identifying distinct values.factorizeis available as both a top-level function
pandas.factorize()
,and as a methodSeries.factorize()
andIndex.factorize()
.- Parameters:
- sortbool, default False
Sortuniques and shufflecodes to maintain therelationship.
- use_na_sentinelbool, default True
If True, the sentinel -1 will be used for NaN values. If False,NaN values will be encoded as non-negative integers and will not drop theNaN from the uniques of the values.
Added in version 1.5.0.
- Returns:
- codesndarray
An integer ndarray that’s an indexer intouniques.
uniques.take(codes)
will have the same values asvalues.- uniquesndarray, Index, or Categorical
The unique valid values. Whenvalues is Categorical,uniquesis a Categorical. Whenvalues is some other pandas object, anIndex is returned. Otherwise, a 1-D ndarray is returned.
Note
Even if there’s a missing value invalues,uniques willnot contain an entry for it.
Notes
Referencethe user guide for more examples.
Examples
These examples all show factorize as a top-level method like
pd.factorize(values)
. The results are identical for methods likeSeries.factorize()
.>>>codes,uniques=pd.factorize(np.array(['b','b','a','c','b'],dtype="O"))>>>codesarray([0, 0, 1, 2, 0])>>>uniquesarray(['b', 'a', 'c'], dtype=object)
With
sort=True
, theuniques will be sorted, andcodes will beshuffled so that the relationship is the maintained.>>>codes,uniques=pd.factorize(np.array(['b','b','a','c','b'],dtype="O"),...sort=True)>>>codesarray([1, 1, 0, 2, 1])>>>uniquesarray(['a', 'b', 'c'], dtype=object)
When
use_na_sentinel=True
(the default), missing values are indicated inthecodes with the sentinel value-1
and missing values are notincluded inuniques.>>>codes,uniques=pd.factorize(np.array(['b',None,'a','c','b'],dtype="O"))>>>codesarray([ 0, -1, 1, 2, 0])>>>uniquesarray(['b', 'a', 'c'], dtype=object)
Thus far, we’ve only factorized lists (which are internally coerced toNumPy arrays). When factorizing pandas objects, the type ofuniqueswill differ. For Categoricals, aCategorical is returned.
>>>cat=pd.Categorical(['a','a','c'],categories=['a','b','c'])>>>codes,uniques=pd.factorize(cat)>>>codesarray([0, 0, 1])>>>uniques['a', 'c']Categories (3, object): ['a', 'b', 'c']
Notice that
'b'
is inuniques.categories
, despite not beingpresent incat.values
.For all other pandas objects, an Index of the appropriate type isreturned.
>>>cat=pd.Series(['a','a','c'])>>>codes,uniques=pd.factorize(cat)>>>codesarray([0, 0, 1])>>>uniquesIndex(['a', 'c'], dtype='object')
If NaN is in the values, and we want to include NaN in the uniques of thevalues, it can be achieved by setting
use_na_sentinel=False
.>>>values=np.array([1,2,1,np.nan])>>>codes,uniques=pd.factorize(values)# default: use_na_sentinel=True>>>codesarray([ 0, 1, 0, -1])>>>uniquesarray([1., 2.])
>>>codes,uniques=pd.factorize(values,use_na_sentinel=False)>>>codesarray([0, 1, 0, 2])>>>uniquesarray([ 1., 2., nan])