Categorical data #

This is an introduction to pandas categorical data type, including a short comparisonwith R’sfactor.

Categoricals are a pandas data type corresponding to categorical variables instatistics. A categorical variable takes on a limited, and usually fixed,number of possible values (categories;levels in R). Examples are gender,social class, blood type, country affiliation, observation time or rating viaLikert scales.

In contrast to statistical categorical variables, categorical data might have an order (e.g.‘strongly agree’ vs ‘agree’ or ‘first observation’ vs. ‘second observation’), but numericaloperations (additions, divisions, …) are not possible.

All values of categorical data are either incategories ornp.nan. Order is defined bythe order ofcategories, not lexical order of the values. Internally, the data structureconsists of acategories array and an integer array ofcodes which point to the real value inthecategories array.

The categorical data type is useful in the following cases:

A string variable consisting of only a few different values. Converting such a stringvariable to a categorical variable will save some memory, seehere.
The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”).By converting to a categorical and specifying an order on the categories, sorting andmin/max will use the logical order instead of the lexical order, seehere.
As a signal to other Python libraries that this column should be treated as a categoricalvariable (e.g. to use suitable statistical methods or plot types).

Object creation#

Series creation#

CategoricalSeries or columns in aDataFrame can be created in several ways:

By specifyingdtype="category" when constructing aSeries:

In [1]:s=pd.Series(["a","b","c","a"],dtype="category")In [2]:sOut[2]:0    a1    b2    c3    adtype: categoryCategories (3, object): ['a', 'b', 'c']

By converting an existingSeries or column to acategory dtype:

In [3]:df=pd.DataFrame({"A":["a","b","c","a"]})In [4]:df["B"]=df["A"].astype("category")In [5]:dfOut[5]:   A  B0  a  a1  b  b2  c  c3  a  a

By using special functions, such ascut(), which groups data intodiscrete bins. See theexample on tiling in the docs.

In [6]:df=pd.DataFrame({"value":np.random.randint(0,100,20)})In [7]:labels=["{0} -{1}".format(i,i+9)foriinrange(0,100,10)]In [8]:df["group"]=pd.cut(df.value,range(0,105,10),right=False,labels=labels)In [9]:df.head(10)Out[9]:   value    group0     65  60 - 691     49  40 - 492     56  50 - 593     43  40 - 494     43  40 - 495     91  90 - 996     32  30 - 397     87  80 - 898     36  30 - 399      8    0 - 9

By passing apandas.Categorical object to aSeries or assigning it to aDataFrame.

In [10]:raw_cat=pd.Categorical(   ....:["a","b","c","a"],categories=["b","c","d"],ordered=False   ....:)   ....:In [11]:s=pd.Series(raw_cat)In [12]:sOut[12]:0    NaN1      b2      c3    NaNdtype: categoryCategories (3, object): ['b', 'c', 'd']In [13]:df=pd.DataFrame({"A":["a","b","c","a"]})In [14]:df["B"]=raw_catIn [15]:dfOut[15]:   A    B0  a  NaN1  b    b2  c    c3  a  NaN

Categorical data has a specificcategorydtype:

In [16]:df.dtypesOut[16]:A      objectB    categorydtype: object

DataFrame creation#

Similar to the previous section where a single column was converted to categorical, all columns in aDataFrame can be batch converted to categorical either during or after construction.

This can be done during construction by specifyingdtype="category" in theDataFrame constructor:

In [17]:df=pd.DataFrame({"A":list("abca"),"B":list("bccd")},dtype="category")In [18]:df.dtypesOut[18]:A    categoryB    categorydtype: object

Note that the categories present in each column differ; the conversion is done column by column, soonly labels present in a given column are categories:

In [19]:df["A"]Out[19]:0    a1    b2    c3    aName: A, dtype: categoryCategories (3, object): ['a', 'b', 'c']In [20]:df["B"]Out[20]:0    b1    c2    c3    dName: B, dtype: categoryCategories (3, object): ['b', 'c', 'd']

Analogously, all columns in an existingDataFrame can be batch converted usingDataFrame.astype():

In [21]:df=pd.DataFrame({"A":list("abca"),"B":list("bccd")})In [22]:df_cat=df.astype("category")In [23]:df_cat.dtypesOut[23]:A    categoryB    categorydtype: object

This conversion is likewise done column by column:

In [24]:df_cat["A"]Out[24]:0    a1    b2    c3    aName: A, dtype: categoryCategories (3, object): ['a', 'b', 'c']In [25]:df_cat["B"]Out[25]:0    b1    c2    c3    dName: B, dtype: categoryCategories (3, object): ['b', 'c', 'd']

Controlling behavior#

In the examples above where we passeddtype='category', we used the defaultbehavior:

Categories are inferred from the data.
Categories are unordered.

To control those behaviors, instead of passing'category', use an instanceofCategoricalDtype.

In [26]:frompandas.api.typesimportCategoricalDtypeIn [27]:s=pd.Series(["a","b","c","a"])In [28]:cat_type=CategoricalDtype(categories=["b","c","d"],ordered=True)In [29]:s_cat=s.astype(cat_type)In [30]:s_catOut[30]:0    NaN1      b2      c3    NaNdtype: categoryCategories (3, object): ['b' < 'c' < 'd']

Similarly, aCategoricalDtype can be used with aDataFrame to ensure that categoriesare consistent among all columns.

In [31]:frompandas.api.typesimportCategoricalDtypeIn [32]:df=pd.DataFrame({"A":list("abca"),"B":list("bccd")})In [33]:cat_type=CategoricalDtype(categories=list("abcd"),ordered=True)In [34]:df_cat=df.astype(cat_type)In [35]:df_cat["A"]Out[35]:0    a1    b2    c3    aName: A, dtype: categoryCategories (4, object): ['a' < 'b' < 'c' < 'd']In [36]:df_cat["B"]Out[36]:0    b1    c2    c3    dName: B, dtype: categoryCategories (4, object): ['a' < 'b' < 'c' < 'd']

Note

To perform table-wise conversion, where all labels in the entireDataFrame are used ascategories for each column, thecategories parameter can be determined programmatically bycategories=pd.unique(df.to_numpy().ravel()).

If you already havecodes andcategories, you can use thefrom_codes() constructor to save the factorize stepduring normal constructor mode:

In [37]:splitter=np.random.choice([0,1],5,p=[0.5,0.5])In [38]:s=pd.Series(pd.Categorical.from_codes(splitter,categories=["train","test"]))

Regaining original data#

To get back to the originalSeries or NumPy array, useSeries.astype(original_dtype) ornp.asarray(categorical):

In [39]:s=pd.Series(["a","b","c","a"])In [40]:sOut[40]:0    a1    b2    c3    adtype: objectIn [41]:s2=s.astype("category")In [42]:s2Out[42]:0    a1    b2    c3    adtype: categoryCategories (3, object): ['a', 'b', 'c']In [43]:s2.astype(str)Out[43]:0    a1    b2    c3    adtype: objectIn [44]:np.asarray(s2)Out[44]:array(['a', 'b', 'c', 'a'], dtype=object)

Note

In contrast to R’sfactor function, categorical data is not converting input values tostrings; categories will end up the same data type as the original values.

Note

In contrast to R’sfactor function, there is currently no way to assign/change labels atcreation time. Usecategories to change the categories after creation time.

CategoricalDtype#

A categorical’s type is fully described by

categories: a sequence of unique values and no missing values
ordered: a boolean

This information can be stored in aCategoricalDtype.Thecategories argument is optional, which implies that the actual categoriesshould be inferred from whatever is present in the data when thepandas.Categorical is created. The categories are assumed to be unorderedby default.

In [45]:frompandas.api.typesimportCategoricalDtypeIn [46]:CategoricalDtype(["a","b","c"])Out[46]:CategoricalDtype(categories=['a', 'b', 'c'], ordered=False, categories_dtype=object)In [47]:CategoricalDtype(["a","b","c"],ordered=True)Out[47]:CategoricalDtype(categories=['a', 'b', 'c'], ordered=True, categories_dtype=object)In [48]:CategoricalDtype()Out[48]:CategoricalDtype(categories=None, ordered=False, categories_dtype=None)

ACategoricalDtype can be used in any place pandasexpects adtype. For examplepandas.read_csv(),pandas.DataFrame.astype(), or in theSeries constructor.

Note

As a convenience, you can use the string'category' in place of aCategoricalDtype when you want the default behavior ofthe categories being unordered, and equal to the set values present in thearray. In other words,dtype='category' is equivalent todtype=CategoricalDtype().

Equality semantics#

Two instances ofCategoricalDtype compare equalwhenever they have the same categories and order. When comparing twounordered categoricals, the order of thecategories is not considered.

In [49]:c1=CategoricalDtype(["a","b","c"],ordered=False)# Equal, since order is not considered when ordered=FalseIn [50]:c1==CategoricalDtype(["b","c","a"],ordered=False)Out[50]:True# Unequal, since the second CategoricalDtype is orderedIn [51]:c1==CategoricalDtype(["a","b","c"],ordered=True)Out[51]:False

All instances ofCategoricalDtype compare equal to the string'category'.

In [52]:c1=="category"Out[52]:True

Description#

Usingdescribe() on categorical data will produce similaroutput to aSeries orDataFrame of typestring.

In [53]:cat=pd.Categorical(["a","c","c",np.nan],categories=["b","a","c"])In [54]:df=pd.DataFrame({"cat":cat,"s":["a","c","c",np.nan]})In [55]:df.describe()Out[55]:       cat  scount    3  3unique   2  2top      c  cfreq     2  2In [56]:df["cat"].describe()Out[56]:count     3unique    2top       cfreq      2Name: cat, dtype: object

Working with categories#

Categorical data has acategories and aordered property, which list theirpossible values and whether the ordering matters or not. These properties areexposed ass.cat.categories ands.cat.ordered. If you don’t manuallyspecify categories and ordering, they are inferred from the passed arguments.

In [57]:s=pd.Series(["a","b","c","a"],dtype="category")In [58]:s.cat.categoriesOut[58]:Index(['a', 'b', 'c'], dtype='object')In [59]:s.cat.orderedOut[59]:False

It’s also possible to pass in the categories in a specific order:

In [60]:s=pd.Series(pd.Categorical(["a","b","c","a"],categories=["c","b","a"]))In [61]:s.cat.categoriesOut[61]:Index(['c', 'b', 'a'], dtype='object')In [62]:s.cat.orderedOut[62]:False

Note

New categorical data arenot automatically ordered. You must explicitlypassordered=True to indicate an orderedCategorical.

Note

The result ofunique() is not always the same asSeries.cat.categories,becauseSeries.unique() has a couple of guarantees, namely that it returns categoriesin the order of appearance, and it only includes values that are actually present.

In [63]:s=pd.Series(list("babc")).astype(CategoricalDtype(list("abcd")))In [64]:sOut[64]:0    b1    a2    b3    cdtype: categoryCategories (4, object): ['a', 'b', 'c', 'd']# categoriesIn [65]:s.cat.categoriesOut[65]:Index(['a', 'b', 'c', 'd'], dtype='object')# uniquesIn [66]:s.unique()Out[66]:['b', 'a', 'c']Categories (4, object): ['a', 'b', 'c', 'd']

Renaming categories#

Renaming categories is done by using therename_categories() method:

In [67]:s=pd.Series(["a","b","c","a"],dtype="category")In [68]:sOut[68]:0    a1    b2    c3    adtype: categoryCategories (3, object): ['a', 'b', 'c']In [69]:new_categories=["Group%s"%gforgins.cat.categories]In [70]:s=s.cat.rename_categories(new_categories)In [71]:sOut[71]:0    Group a1    Group b2    Group c3    Group adtype: categoryCategories (3, object): ['Group a', 'Group b', 'Group c']# You can also pass a dict-like object to map the renamingIn [72]:s=s.cat.rename_categories({1:"x",2:"y",3:"z"})In [73]:sOut[73]:0    Group a1    Group b2    Group c3    Group adtype: categoryCategories (3, object): ['Group a', 'Group b', 'Group c']

Note

In contrast to R’sfactor, categorical data can have categories of other types than string.

Categories must be unique or aValueError is raised:

In [74]:try:   ....:s=s.cat.rename_categories([1,1,1])   ....:exceptValueErrorase:   ....:print("ValueError:",str(e))   ....:ValueError: Categorical categories must be unique

Categories must also not beNaN or aValueError is raised:

In [75]:try:   ....:s=s.cat.rename_categories([1,2,np.nan])   ....:exceptValueErrorase:   ....:print("ValueError:",str(e))   ....:ValueError: Categorical categories cannot be null

Appending new categories#

Appending categories can be done by using theadd_categories() method:

In [76]:s=s.cat.add_categories([4])In [77]:s.cat.categoriesOut[77]:Index(['Group a', 'Group b', 'Group c', 4], dtype='object')In [78]:sOut[78]:0    Group a1    Group b2    Group c3    Group adtype: categoryCategories (4, object): ['Group a', 'Group b', 'Group c', 4]

Removing categories#

Removing categories can be done by using theremove_categories() method. Values which are removedare replaced bynp.nan.:

In [79]:s=s.cat.remove_categories([4])In [80]:sOut[80]:0    Group a1    Group b2    Group c3    Group adtype: categoryCategories (3, object): ['Group a', 'Group b', 'Group c']

Removing unused categories#

Removing unused categories can also be done:

In [81]:s=pd.Series(pd.Categorical(["a","b","a"],categories=["a","b","c","d"]))In [82]:sOut[82]:0    a1    b2    adtype: categoryCategories (4, object): ['a', 'b', 'c', 'd']In [83]:s.cat.remove_unused_categories()Out[83]:0    a1    b2    adtype: categoryCategories (2, object): ['a', 'b']

Setting categories#

If you want to do remove and add new categories in one step (which has somespeed advantage), or simply set the categories to a predefined scale,useset_categories().

In [84]:s=pd.Series(["one","two","four","-"],dtype="category")In [85]:sOut[85]:0     one1     two2    four3       -dtype: categoryCategories (4, object): ['-', 'four', 'one', 'two']In [86]:s=s.cat.set_categories(["one","two","three","four"])In [87]:sOut[87]:0     one1     two2    four3     NaNdtype: categoryCategories (4, object): ['one', 'two', 'three', 'four']

Note

Be aware thatCategorical.set_categories() cannot know whether some category is omittedintentionally or because it is misspelled or (under Python3) due to a type difference (e.g.,NumPy S1 dtype and Python strings). This can result in surprising behaviour!

Sorting and order#

If categorical data is ordered (s.cat.ordered==True), then the order of the categories has ameaning and certain operations are possible. If the categorical is unordered,.min()/.max() will raise aTypeError.

In [88]:s=pd.Series(pd.Categorical(["a","b","c","a"],ordered=False))In [89]:s=s.sort_values()In [90]:s=pd.Series(["a","b","c","a"]).astype(CategoricalDtype(ordered=True))In [91]:s=s.sort_values()In [92]:sOut[92]:0    a3    a1    b2    cdtype: categoryCategories (3, object): ['a' < 'b' < 'c']In [93]:s.min(),s.max()Out[93]:('a', 'c')

You can set categorical data to be ordered by usingas_ordered() or unordered by usingas_unordered(). These will bydefault return anew object.

In [94]:s.cat.as_ordered()Out[94]:0    a3    a1    b2    cdtype: categoryCategories (3, object): ['a' < 'b' < 'c']In [95]:s.cat.as_unordered()Out[95]:0    a3    a1    b2    cdtype: categoryCategories (3, object): ['a', 'b', 'c']

Sorting will use the order defined by categories, not any lexical order present on the data type.This is even true for strings and numeric data:

In [96]:s=pd.Series([1,2,3,1],dtype="category")In [97]:s=s.cat.set_categories([2,3,1],ordered=True)In [98]:sOut[98]:0    11    22    33    1dtype: categoryCategories (3, int64): [2 < 3 < 1]In [99]:s=s.sort_values()In [100]:sOut[100]:1    22    30    13    1dtype: categoryCategories (3, int64): [2 < 3 < 1]In [101]:s.min(),s.max()Out[101]:(2, 1)

Reordering#

Reordering the categories is possible via theCategorical.reorder_categories() andtheCategorical.set_categories() methods. ForCategorical.reorder_categories(), allold categories must be included in the new categories and no new categories are allowed. This willnecessarily make the sort order the same as the categories order.

In [102]:s=pd.Series([1,2,3,1],dtype="category")In [103]:s=s.cat.reorder_categories([2,3,1],ordered=True)In [104]:sOut[104]:0    11    22    33    1dtype: categoryCategories (3, int64): [2 < 3 < 1]In [105]:s=s.sort_values()In [106]:sOut[106]:1    22    30    13    1dtype: categoryCategories (3, int64): [2 < 3 < 1]In [107]:s.min(),s.max()Out[107]:(2, 1)

Note

Note the difference between assigning new categories and reordering the categories: the firstrenames categories and therefore the individual values in theSeries, but if the firstposition was sorted last, the renamed value will still be sorted last. Reordering means that theway values are sorted is different afterwards, but not that individual values in theSeries are changed.

Note

If theCategorical is not ordered,Series.min() andSeries.max() will raiseTypeError. Numeric operations like+,-,*,/ and operations based on them(e.g.Series.median(), which would need to compute the mean between two values if the lengthof an array is even) do not work and raise aTypeError.

Multi column sorting#

A categorical dtyped column will participate in a multi-column sort in a similar manner to other columns.The ordering of the categorical is determined by thecategories of that column.

In [108]:dfs=pd.DataFrame(   .....:{   .....:"A":pd.Categorical(   .....:list("bbeebbaa"),   .....:categories=["e","a","b"],   .....:ordered=True,   .....:),   .....:"B":[1,2,1,2,2,1,2,1],   .....:}   .....:)   .....:In [109]:dfs.sort_values(by=["A","B"])Out[109]:   A  B2  e  13  e  27  a  16  a  20  b  15  b  11  b  24  b  2

Reordering thecategories changes a future sort.

In [110]:dfs["A"]=dfs["A"].cat.reorder_categories(["a","b","e"])In [111]:dfs.sort_values(by=["A","B"])Out[111]:   A  B7  a  16  a  20  b  15  b  11  b  24  b  22  e  13  e  2

Comparisons#

Comparing categorical data with other objects is possible in three cases:

Comparing equality (== and!=) to a list-like object (list, Series, array,…) of the same length as the categorical data.
All comparisons (==,!=,>,>=,<, and<=) of categorical data toanother categorical Series, whenordered==True and thecategories are the same.
All comparisons of a categorical data to a scalar.

All other comparisons, especially “non-equality” comparisons of two categoricals with differentcategories or a categorical with any list-like object, will raise aTypeError.

Note

Any “non-equality” comparisons of categorical data with aSeries,np.array,list orcategorical data with different categories or ordering will raise aTypeError because customcategories ordering could be interpreted in two ways: one with taking into account theordering and one without.

In [112]:cat=pd.Series([1,2,3]).astype(CategoricalDtype([3,2,1],ordered=True))In [113]:cat_base=pd.Series([2,2,2]).astype(CategoricalDtype([3,2,1],ordered=True))In [114]:cat_base2=pd.Series([2,2,2]).astype(CategoricalDtype(ordered=True))In [115]:catOut[115]:0    11    22    3dtype: categoryCategories (3, int64): [3 < 2 < 1]In [116]:cat_baseOut[116]:0    21    22    2dtype: categoryCategories (3, int64): [3 < 2 < 1]In [117]:cat_base2Out[117]:0    21    22    2dtype: categoryCategories (1, int64): [2]

Comparing to a categorical with the same categories and ordering or to a scalar works:

In [118]:cat>cat_baseOut[118]:0     True1    False2    Falsedtype: boolIn [119]:cat>2Out[119]:0     True1    False2    Falsedtype: bool

Equality comparisons work with any list-like object of same length and scalars:

In [120]:cat==cat_baseOut[120]:0    False1     True2    Falsedtype: boolIn [121]:cat==np.array([1,2,3])Out[121]:0    True1    True2    Truedtype: boolIn [122]:cat==2Out[122]:0    False1     True2    Falsedtype: bool

This doesn’t work because the categories are not the same:

In [123]:try:   .....:cat>cat_base2   .....:exceptTypeErrorase:   .....:print("TypeError:",str(e))   .....:TypeError: Categoricals can only be compared if 'categories' are the same.

If you want to do a “non-equality” comparison of a categorical series with a list-like objectwhich is not categorical data, you need to be explicit and convert the categorical data back tothe original values:

In [124]:base=np.array([1,2,3])In [125]:try:   .....:cat>base   .....:exceptTypeErrorase:   .....:print("TypeError:",str(e))   .....:TypeError: Cannot compare a Categorical for op __gt__ with type <class 'numpy.ndarray'>.If you want to compare values, use 'np.asarray(cat) <op> other'.In [126]:np.asarray(cat)>baseOut[126]:array([False, False, False])

When you compare two unordered categoricals with the same categories, the order is not considered:

In [127]:c1=pd.Categorical(["a","b"],categories=["a","b"],ordered=False)In [128]:c2=pd.Categorical(["a","b"],categories=["b","a"],ordered=False)In [129]:c1==c2Out[129]:array([ True,  True])

Operations#

Apart fromSeries.min(),Series.max() andSeries.mode(), thefollowing operations are possible with categorical data:

Series methods likeSeries.value_counts() will use all categories,even if some categories are not present in the data:

In [130]:s=pd.Series(pd.Categorical(["a","b","c","c"],categories=["c","a","b","d"]))In [131]:s.value_counts()Out[131]:c    2a    1b    1d    0Name: count, dtype: int64

DataFrame methods likeDataFrame.sum() also show “unused” categories whenobserved=False.

In [132]:columns=pd.Categorical(   .....:["One","One","Two"],categories=["One","Two","Three"],ordered=True   .....:)   .....:In [133]:df=pd.DataFrame(   .....:data=[[1,2,3],[4,5,6]],   .....:columns=pd.MultiIndex.from_arrays([["A","B","B"],columns]),   .....:).T   .....:In [134]:df.groupby(level=1,observed=False).sum()Out[134]:       0  1One    3  9Two    3  6Three  0  0

Groupby will also show “unused” categories whenobserved=False:

In [135]:cats=pd.Categorical(   .....:["a","b","b","b","c","c","c"],categories=["a","b","c","d"]   .....:)   .....:In [136]:df=pd.DataFrame({"cats":cats,"values":[1,2,2,2,3,4,5]})In [137]:df.groupby("cats",observed=False).mean()Out[137]:      valuescatsa        1.0b        2.0c        4.0d        NaNIn [138]:cats2=pd.Categorical(["a","a","b","b"],categories=["a","b","c"])In [139]:df2=pd.DataFrame(   .....:{   .....:"cats":cats2,   .....:"B":["c","d","c","d"],   .....:"values":[1,2,3,4],   .....:}   .....:)   .....:In [140]:df2.groupby(["cats","B"],observed=False).mean()Out[140]:        valuescats Ba    c     1.0     d     2.0b    c     3.0     d     4.0c    c     NaN     d     NaN

Pivot tables:

In [141]:raw_cat=pd.Categorical(["a","a","b","b"],categories=["a","b","c"])In [142]:df=pd.DataFrame({"A":raw_cat,"B":["c","d","c","d"],"values":[1,2,3,4]})In [143]:pd.pivot_table(df,values="values",index=["A","B"],observed=False)Out[143]:     valuesA Ba c     1.0  d     2.0b c     3.0  d     4.0

Data munging#

The optimized pandas data access methods.loc,.iloc,.at, and.iat,work as normal. The only difference is the return type (for getting) andthat only values already incategories can be assigned.

Getting#

If the slicing operation returns either aDataFrame or a column of typeSeries, thecategory dtype is preserved.

In [144]:idx=pd.Index(["h","i","j","k","l","m","n"])In [145]:cats=pd.Series(["a","b","b","b","c","c","c"],dtype="category",index=idx)In [146]:values=[1,2,2,2,3,4,5]In [147]:df=pd.DataFrame({"cats":cats,"values":values},index=idx)In [148]:df.iloc[2:4,:]Out[148]:  cats  valuesj    b       2k    b       2In [149]:df.iloc[2:4,:].dtypesOut[149]:cats      categoryvalues       int64dtype: objectIn [150]:df.loc["h":"j","cats"]Out[150]:h    ai    bj    bName: cats, dtype: categoryCategories (3, object): ['a', 'b', 'c']In [151]:df[df["cats"]=="b"]Out[151]:  cats  valuesi    b       2j    b       2k    b       2

An example where the category type is not preserved is if you take one singlerow: the resultingSeries is of dtypeobject:

# get the complete "h" row as a SeriesIn [152]:df.loc["h",:]Out[152]:cats      avalues    1Name: h, dtype: object

Returning a single item from categorical data will also return the value, not a categoricalof length “1”.

In [153]:df.iat[0,0]Out[153]:'a'In [154]:df["cats"]=df["cats"].cat.rename_categories(["x","y","z"])In [155]:df.at["h","cats"]# returns a stringOut[155]:'x'

Note

The is in contrast to R’sfactor function, wherefactor(c(1,2,3))[1]returns a single valuefactor.

To get a single valueSeries of typecategory, you pass in a list witha single value:

In [156]:df.loc[["h"],"cats"]Out[156]:h    xName: cats, dtype: categoryCategories (3, object): ['x', 'y', 'z']

String and datetime accessors#

The accessors.dt and.str will work if thes.cat.categories are ofan appropriate type:

In [157]:str_s=pd.Series(list("aabb"))In [158]:str_cat=str_s.astype("category")In [159]:str_catOut[159]:0    a1    a2    b3    bdtype: categoryCategories (2, object): ['a', 'b']In [160]:str_cat.str.contains("a")Out[160]:0     True1     True2    False3    Falsedtype: boolIn [161]:date_s=pd.Series(pd.date_range("1/1/2015",periods=5))In [162]:date_cat=date_s.astype("category")In [163]:date_catOut[163]:0   2015-01-011   2015-01-022   2015-01-033   2015-01-044   2015-01-05dtype: categoryCategories (5, datetime64[ns]): [2015-01-01, 2015-01-02, 2015-01-03, 2015-01-04, 2015-01-05]In [164]:date_cat.dt.dayOut[164]:0    11    22    33    44    5dtype: int32

Note

The returnedSeries (orDataFrame) is of the same type as if you used the.str.<method> /.dt.<method> on aSeries of that type (and not oftypecategory!).

That means, that the returned values from methods and properties on the accessors of aSeries and the returned values from methods and properties on the accessors of thisSeries transformed to one of typecategory will be equal:

In [165]:ret_s=str_s.str.contains("a")In [166]:ret_cat=str_cat.str.contains("a")In [167]:ret_s.dtype==ret_cat.dtypeOut[167]:TrueIn [168]:ret_s==ret_catOut[168]:0    True1    True2    True3    Truedtype: bool

Note

The work is done on thecategories and then a newSeries is constructed. This hassome performance implication if you have aSeries of type string, where lots of elementsare repeated (i.e. the number of unique elements in theSeries is a lot smaller than thelength of theSeries). In this case it can be faster to convert the originalSeriesto one of typecategory and use.str.<method> or.dt.<property> on that.

Setting#

Setting values in a categorical column (orSeries) works as long as thevalue is included in thecategories:

In [169]:idx=pd.Index(["h","i","j","k","l","m","n"])In [170]:cats=pd.Categorical(["a","a","a","a","a","a","a"],categories=["a","b"])In [171]:values=[1,1,1,1,1,1,1]In [172]:df=pd.DataFrame({"cats":cats,"values":values},index=idx)In [173]:df.iloc[2:4,:]=[["b",2],["b",2]]In [174]:dfOut[174]:  cats  valuesh    a       1i    a       1j    b       2k    b       2l    a       1m    a       1n    a       1In [175]:try:   .....:df.iloc[2:4,:]=[["c",3],["c",3]]   .....:exceptTypeErrorase:   .....:print("TypeError:",str(e))   .....:TypeError: Cannot setitem on a Categorical with a new category, set the categories first

Setting values by assigning categorical data will also check that thecategories match:

In [176]:df.loc["j":"k","cats"]=pd.Categorical(["a","a"],categories=["a","b"])In [177]:dfOut[177]:  cats  valuesh    a       1i    a       1j    a       2k    a       2l    a       1m    a       1n    a       1In [178]:try:   .....:df.loc["j":"k","cats"]=pd.Categorical(["b","b"],categories=["a","b","c"])   .....:exceptTypeErrorase:   .....:print("TypeError:",str(e))   .....:TypeError: Cannot set a Categorical with another, without identical categories

Assigning aCategorical to parts of a column of other types will use the values:

In [179]:df=pd.DataFrame({"a":[1,1,1,1,1],"b":["a","a","a","a","a"]})In [180]:df.loc[1:2,"a"]=pd.Categorical(["b","b"],categories=["a","b"])In [181]:df.loc[2:3,"b"]=pd.Categorical(["b","b"],categories=["a","b"])In [182]:dfOut[182]:   a  b0  1  a1  b  a2  b  b3  1  b4  1  aIn [183]:df.dtypesOut[183]:a    objectb    objectdtype: object

Merging / concatenation#

By default, combiningSeries orDataFrames which contain the samecategories results incategory dtype, otherwise results will depend on thedtype of the underlying categories. Merges that result in non-categoricaldtypes will likely have higher memory usage. Use.astype orunion_categoricals to ensurecategory results.

In [184]:frompandas.api.typesimportunion_categoricals# same categoriesIn [185]:s1=pd.Series(["a","b"],dtype="category")In [186]:s2=pd.Series(["a","b","a"],dtype="category")In [187]:pd.concat([s1,s2])Out[187]:0    a1    b0    a1    b2    adtype: categoryCategories (2, object): ['a', 'b']# different categoriesIn [188]:s3=pd.Series(["b","c"],dtype="category")In [189]:pd.concat([s1,s3])Out[189]:0    a1    b0    b1    cdtype: object# Output dtype is inferred based on categories valuesIn [190]:int_cats=pd.Series([1,2],dtype="category")In [191]:float_cats=pd.Series([3.0,4.0],dtype="category")In [192]:pd.concat([int_cats,float_cats])Out[192]:0    1.01    2.00    3.01    4.0dtype: float64In [193]:pd.concat([s1,s3]).astype("category")Out[193]:0    a1    b0    b1    cdtype: categoryCategories (3, object): ['a', 'b', 'c']In [194]:union_categoricals([s1.array,s3.array])Out[194]:['a', 'b', 'b', 'c']Categories (3, object): ['a', 'b', 'c']

The following table summarizes the results of mergingCategoricals:

arg1	arg2	identical	result
category	category	True	category
category (object)	category (object)	False	object (dtype is inferred)
category (int)	category (float)	False	float (dtype is inferred)

Unioning#

If you want to combine categoricals that do not necessarily have the samecategories, theunion_categoricals() function willcombine a list-like of categoricals. The new categories will be the union ofthe categories being combined.

In [195]:frompandas.api.typesimportunion_categoricalsIn [196]:a=pd.Categorical(["b","c"])In [197]:b=pd.Categorical(["a","b"])In [198]:union_categoricals([a,b])Out[198]:['b', 'c', 'a', 'b']Categories (3, object): ['b', 'c', 'a']

By default, the resulting categories will be ordered asthey appear in the data. If you want the categories tobe lexsorted, usesort_categories=True argument.

In [199]:union_categoricals([a,b],sort_categories=True)Out[199]:['b', 'c', 'a', 'b']Categories (3, object): ['a', 'b', 'c']

union_categoricals also works with the “easy” case of combining twocategoricals of the same categories and order information(e.g. what you could alsoappend for).

In [200]:a=pd.Categorical(["a","b"],ordered=True)In [201]:b=pd.Categorical(["a","b","a"],ordered=True)In [202]:union_categoricals([a,b])Out[202]:['a', 'b', 'a', 'b', 'a']Categories (2, object): ['a' < 'b']

The below raisesTypeError because the categories are ordered and not identical.

In [203]:a=pd.Categorical(["a","b"],ordered=True)In [204]:b=pd.Categorical(["a","b","c"],ordered=True)In [205]:union_categoricals([a,b])---------------------------------------------------------------------------TypeErrorTraceback (most recent call last)CellIn[205],line1---->1union_categoricals([a,b])File ~/work/pandas/pandas/pandas/core/dtypes/concat.py:341, inunion_categoricals(to_union, sort_categories, ignore_order)339ifall(c.orderedforcinto_union):340msg="to union ordered Categoricals, all categories must be the same"-->341raiseTypeError(msg)342raiseTypeError("Categorical.ordered must be the same")344ifignore_order:TypeError: to union ordered Categoricals, all categories must be the same

Ordered categoricals with different categories or orderings can be combined byusing theignore_ordered=True argument.

In [206]:a=pd.Categorical(["a","b","c"],ordered=True)In [207]:b=pd.Categorical(["c","b","a"],ordered=True)In [208]:union_categoricals([a,b],ignore_order=True)Out[208]:['a', 'b', 'c', 'c', 'b', 'a']Categories (3, object): ['a', 'b', 'c']

union_categoricals() also works with aCategoricalIndex, orSeries containing categorical data, but note thatthe resulting array will always be a plainCategorical:

In [209]:a=pd.Series(["b","c"],dtype="category")In [210]:b=pd.Series(["a","b"],dtype="category")In [211]:union_categoricals([a,b])Out[211]:['b', 'c', 'a', 'b']Categories (3, object): ['b', 'c', 'a']

Note

union_categoricals may recode the integer codes for categorieswhen combining categoricals. This is likely what you want,but if you are relying on the exact numbering of the categories, beaware.

In [212]:c1=pd.Categorical(["b","c"])In [213]:c2=pd.Categorical(["a","b"])In [214]:c1Out[214]:['b', 'c']Categories (2, object): ['b', 'c']# "b" is coded to 0In [215]:c1.codesOut[215]:array([0, 1], dtype=int8)In [216]:c2Out[216]:['a', 'b']Categories (2, object): ['a', 'b']# "b" is coded to 1In [217]:c2.codesOut[217]:array([0, 1], dtype=int8)In [218]:c=union_categoricals([c1,c2])In [219]:cOut[219]:['b', 'c', 'a', 'b']Categories (3, object): ['b', 'c', 'a']# "b" is coded to 0 throughout, same as c1, different from c2In [220]:c.codesOut[220]:array([0, 1, 2, 0], dtype=int8)

Getting data in/out#

You can write data that containscategory dtypes to aHDFStore.Seehere for an example and caveats.

It is also possible to write data to and reading data fromStata format files.Seehere for an example and caveats.

Writing to a CSV file will convert the data, effectively removing any information about thecategorical (categories and ordering). So if you read back the CSV file you have to convert therelevant columns back tocategory and assign the right categories and categories ordering.

In [221]:importioIn [222]:s=pd.Series(pd.Categorical(["a","b","b","a","a","d"]))# rename the categoriesIn [223]:s=s.cat.rename_categories(["very good","good","bad"])# reorder the categories and add missing categoriesIn [224]:s=s.cat.set_categories(["very bad","bad","medium","good","very good"])In [225]:df=pd.DataFrame({"cats":s,"vals":[1,2,3,4,5,6]})In [226]:csv=io.StringIO()In [227]:df.to_csv(csv)In [228]:df2=pd.read_csv(io.StringIO(csv.getvalue()))In [229]:df2.dtypesOut[229]:Unnamed: 0     int64cats          objectvals           int64dtype: objectIn [230]:df2["cats"]Out[230]:0    very good1         good2         good3    very good4    very good5          badName: cats, dtype: object# Redo the categoryIn [231]:df2["cats"]=df2["cats"].astype("category")In [232]:df2["cats"]=df2["cats"].cat.set_categories(   .....:["very bad","bad","medium","good","very good"]   .....:)   .....:In [233]:df2.dtypesOut[233]:Unnamed: 0       int64cats          categoryvals             int64dtype: objectIn [234]:df2["cats"]Out[234]:0    very good1         good2         good3    very good4    very good5          badName: cats, dtype: categoryCategories (5, object): ['very bad', 'bad', 'medium', 'good', 'very good']

The same holds for writing to a SQL database withto_sql.

Missing data#

pandas primarily uses the valuenp.nan to represent missing data. It is bydefault not included in computations. See theMissing Data section.

Missing values shouldnot be included in the Categorical’scategories,only in thevalues.Instead, it is understood that NaN is different, and is always a possibility.When working with the Categorical’scodes, missing values will always havea code of-1.

In [235]:s=pd.Series(["a","b",np.nan,"a"],dtype="category")# only two categoriesIn [236]:sOut[236]:0      a1      b2    NaN3      adtype: categoryCategories (2, object): ['a', 'b']In [237]:s.cat.codesOut[237]:0    01    12   -13    0dtype: int8

Methods for working with missing data, e.g.isna(),fillna(),dropna(), all work normally:

In [238]:s=pd.Series(["a","b",np.nan],dtype="category")In [239]:sOut[239]:0      a1      b2    NaNdtype: categoryCategories (2, object): ['a', 'b']In [240]:pd.isna(s)Out[240]:0    False1    False2     Truedtype: boolIn [241]:s.fillna("a")Out[241]:0    a1    b2    adtype: categoryCategories (2, object): ['a', 'b']

Differences to R’s`factor`#

The following differences to R’s factor functions can be observed:

R’slevels are namedcategories.
R’slevels are always of type string, whilecategories in pandas can be of any dtype.
It’s not possible to specify labels at creation time. Uses.cat.rename_categories(new_labels)afterwards.
In contrast to R’sfactor function, using categorical data as the sole input to create anew categorical series willnot remove unused categories but create a new categorical serieswhich is equal to the passed in one!
R allows for missing values to be included in itslevels (pandas’categories). pandasdoes not allowNaN categories, but missing values can still be in thevalues.

Gotchas#

Memory usage#

The memory usage of aCategorical is proportional to the number of categories plus the length of the data. In contrast,anobject dtype is a constant times the length of the data.

In [242]:s=pd.Series(["foo","bar"]*1000)# object dtypeIn [243]:s.nbytesOut[243]:16000# category dtypeIn [244]:s.astype("category").nbytesOut[244]:2016

Note

If the number of categories approaches the length of the data, theCategorical will use nearly the same ormore memory than an equivalentobject dtype representation.

In [245]:s=pd.Series(["foo%04d"%iforiinrange(2000)])# object dtypeIn [246]:s.nbytesOut[246]:16000# category dtypeIn [247]:s.astype("category").nbytesOut[247]:20000

`Categorical` is not a`numpy` array#

Currently, categorical data and the underlyingCategorical is implemented as a Pythonobject and not as a low-level NumPy array dtype. This leads to some problems.

NumPy itself doesn’t know about the newdtype:

In [248]:try:   .....:np.dtype("category")   .....:exceptTypeErrorase:   .....:print("TypeError:",str(e))   .....:TypeError: data type 'category' not understoodIn [249]:dtype=pd.Categorical(["a"]).dtypeIn [250]:try:   .....:np.dtype(dtype)   .....:exceptTypeErrorase:   .....:print("TypeError:",str(e))   .....:TypeError: Cannot interpret 'CategoricalDtype(categories=['a'], ordered=False, categories_dtype=object)' as a data type

Dtype comparisons work:

In [251]:dtype==np.str_Out[251]:FalseIn [252]:np.str_==dtypeOut[252]:False

To check if a Series contains Categorical data, usehasattr(s,'cat'):

In [253]:hasattr(pd.Series(["a"],dtype="category"),"cat")Out[253]:TrueIn [254]:hasattr(pd.Series(["a"]),"cat")Out[254]:False

Using NumPy functions on aSeries of typecategory should not work asCategoricalsare not numeric data (even in the case that.categories is numeric).

In [255]:s=pd.Series(pd.Categorical([1,2,3,4]))In [256]:try:   .....:np.sum(s)   .....:exceptTypeErrorase:   .....:print("TypeError:",str(e))   .....:TypeError: 'Categorical' with dtype category does not support reduction 'sum'

Note

If such a function works, please file a bug atpandas-dev/pandas!

dtype in apply#

pandas currently does not preserve the dtype in apply functions: If you apply along rows you getaSeries ofobjectdtype (same as getting a row -> getting one element will return abasic type) and applying along columns will also convert to object.NaN values are unaffected.You can usefillna to handle missing values before applying a function.

In [257]:df=pd.DataFrame(   .....:{   .....:"a":[1,2,3,4],   .....:"b":["a","b","c","d"],   .....:"cats":pd.Categorical([1,2,3,2]),   .....:}   .....:)   .....:In [258]:df.apply(lambdarow:type(row["cats"]),axis=1)Out[258]:0    <class 'int'>1    <class 'int'>2    <class 'int'>3    <class 'int'>dtype: objectIn [259]:df.apply(lambdacol:col.dtype,axis=0)Out[259]:a          int64b         objectcats    categorydtype: object

Categorical index#

CategoricalIndex is a type of index that is useful for supportingindexing with duplicates. This is a container around aCategoricaland allows efficient indexing and storage of an index with a large number of duplicated elements.See theadvanced indexing docs for a more detailedexplanation.

Setting the index will create aCategoricalIndex:

In [260]:cats=pd.Categorical([1,2,3,4],categories=[4,2,3,1])In [261]:strings=["a","b","c","d"]In [262]:values=[4,2,3,1]In [263]:df=pd.DataFrame({"strings":strings,"values":values},index=cats)In [264]:df.indexOut[264]:CategoricalIndex([1, 2, 3, 4], categories=[4, 2, 3, 1], ordered=False, dtype='category')# This now sorts by the categories orderIn [265]:df.sort_index()Out[265]:  strings  values4       d       12       b       23       c       31       a       4

Side effects#

Constructing aSeries from aCategorical will not copy the inputCategorical. This means that changes to theSeries will in most caseschange the originalCategorical:

In [266]:cat=pd.Categorical([1,2,3,10],categories=[1,2,3,4,10])In [267]:s=pd.Series(cat,name="cat")In [268]:catOut[268]:[1, 2, 3, 10]Categories (5, int64): [1, 2, 3, 4, 10]In [269]:s.iloc[0:2]=10In [270]:catOut[270]:[10, 10, 3, 10]Categories (5, int64): [1, 2, 3, 4, 10]

Usecopy=True to prevent such a behaviour or simply don’t reuseCategoricals:

In [271]:cat=pd.Categorical([1,2,3,10],categories=[1,2,3,4,10])In [272]:s=pd.Series(cat,name="cat",copy=True)In [273]:catOut[273]:[1, 2, 3, 10]Categories (5, int64): [1, 2, 3, 4, 10]In [274]:s.iloc[0:2]=10In [275]:catOut[275]:[1, 2, 3, 10]Categories (5, int64): [1, 2, 3, 4, 10]

Note

This also happens in some cases when you supply a NumPy array instead of aCategorical:using an int array (e.g.np.array([1,2,3,4])) will exhibit the same behavior, while usinga string array (e.g.np.array(["a","b","c","a"])) will not.

On this page

Show Source

Movatterモバイル変換

Categorical data#

Object creation#

Series creation#

DataFrame creation#

Controlling behavior#

Regaining original data#

CategoricalDtype#

Equality semantics#

Description#

Working with categories#

Renaming categories#

Appending new categories#

Removing categories#

Removing unused categories#

Setting categories#

Sorting and order#

Reordering#

Multi column sorting#

Comparisons#

Operations#

Data munging#

Getting#

String and datetime accessors#

Setting#

Merging / concatenation#

Unioning#

Getting data in/out#

Missing data#

Differences to R’sfactor#

Gotchas#

Memory usage#

Categorical is not anumpy array#

dtype in apply#

Categorical index#

Side effects#

Categorical data #

Differences to R’s`factor`#

`Categorical` is not a`numpy` array#