- User Guide
- Categorical data
Categorical data#
This is an introduction to pandas categorical data type, including a short comparisonwith R’sfactor
.
Categoricals
are a pandas data type corresponding to categorical variables instatistics. A categorical variable takes on a limited, and usually fixed,number of possible values (categories
;levels
in R). Examples are gender,social class, blood type, country affiliation, observation time or rating viaLikert scales.
In contrast to statistical categorical variables, categorical data might have an order (e.g.‘strongly agree’ vs ‘agree’ or ‘first observation’ vs. ‘second observation’), but numericaloperations (additions, divisions, …) are not possible.
All values of categorical data are either incategories
ornp.nan
. Order is defined bythe order ofcategories
, not lexical order of the values. Internally, the data structureconsists of acategories
array and an integer array ofcodes
which point to the real value inthecategories
array.
The categorical data type is useful in the following cases:
A string variable consisting of only a few different values. Converting such a stringvariable to a categorical variable will save some memory, seehere.
The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”).By converting to a categorical and specifying an order on the categories, sorting andmin/max will use the logical order instead of the lexical order, seehere.
As a signal to other Python libraries that this column should be treated as a categoricalvariable (e.g. to use suitable statistical methods or plot types).
See also theAPI docs on categoricals.
Object creation#
Series creation#
CategoricalSeries
or columns in aDataFrame
can be created in several ways:
By specifyingdtype="category"
when constructing aSeries
:
In [1]:s=pd.Series(["a","b","c","a"],dtype="category")In [2]:sOut[2]:0 a1 b2 c3 adtype: categoryCategories (3, object): ['a', 'b', 'c']
By converting an existingSeries
or column to acategory
dtype:
In [3]:df=pd.DataFrame({"A":["a","b","c","a"]})In [4]:df["B"]=df["A"].astype("category")In [5]:dfOut[5]: A B0 a a1 b b2 c c3 a a
By using special functions, such ascut()
, which groups data intodiscrete bins. See theexample on tiling in the docs.
In [6]:df=pd.DataFrame({"value":np.random.randint(0,100,20)})In [7]:labels=["{0} -{1}".format(i,i+9)foriinrange(0,100,10)]In [8]:df["group"]=pd.cut(df.value,range(0,105,10),right=False,labels=labels)In [9]:df.head(10)Out[9]: value group0 65 60 - 691 49 40 - 492 56 50 - 593 43 40 - 494 43 40 - 495 91 90 - 996 32 30 - 397 87 80 - 898 36 30 - 399 8 0 - 9
By passing apandas.Categorical
object to aSeries
or assigning it to aDataFrame
.
In [10]:raw_cat=pd.Categorical( ....:["a","b","c","a"],categories=["b","c","d"],ordered=False ....:) ....:In [11]:s=pd.Series(raw_cat)In [12]:sOut[12]:0 NaN1 b2 c3 NaNdtype: categoryCategories (3, object): ['b', 'c', 'd']In [13]:df=pd.DataFrame({"A":["a","b","c","a"]})In [14]:df["B"]=raw_catIn [15]:dfOut[15]: A B0 a NaN1 b b2 c c3 a NaN
Categorical data has a specificcategory
dtype:
In [16]:df.dtypesOut[16]:A objectB categorydtype: object
DataFrame creation#
Similar to the previous section where a single column was converted to categorical, all columns in aDataFrame
can be batch converted to categorical either during or after construction.
This can be done during construction by specifyingdtype="category"
in theDataFrame
constructor:
In [17]:df=pd.DataFrame({"A":list("abca"),"B":list("bccd")},dtype="category")In [18]:df.dtypesOut[18]:A categoryB categorydtype: object
Note that the categories present in each column differ; the conversion is done column by column, soonly labels present in a given column are categories:
In [19]:df["A"]Out[19]:0 a1 b2 c3 aName: A, dtype: categoryCategories (3, object): ['a', 'b', 'c']In [20]:df["B"]Out[20]:0 b1 c2 c3 dName: B, dtype: categoryCategories (3, object): ['b', 'c', 'd']
Analogously, all columns in an existingDataFrame
can be batch converted usingDataFrame.astype()
:
In [21]:df=pd.DataFrame({"A":list("abca"),"B":list("bccd")})In [22]:df_cat=df.astype("category")In [23]:df_cat.dtypesOut[23]:A categoryB categorydtype: object
This conversion is likewise done column by column:
In [24]:df_cat["A"]Out[24]:0 a1 b2 c3 aName: A, dtype: categoryCategories (3, object): ['a', 'b', 'c']In [25]:df_cat["B"]Out[25]:0 b1 c2 c3 dName: B, dtype: categoryCategories (3, object): ['b', 'c', 'd']
Controlling behavior#
In the examples above where we passeddtype='category'
, we used the defaultbehavior:
Categories are inferred from the data.
Categories are unordered.
To control those behaviors, instead of passing'category'
, use an instanceofCategoricalDtype
.
In [26]:frompandas.api.typesimportCategoricalDtypeIn [27]:s=pd.Series(["a","b","c","a"])In [28]:cat_type=CategoricalDtype(categories=["b","c","d"],ordered=True)In [29]:s_cat=s.astype(cat_type)In [30]:s_catOut[30]:0 NaN1 b2 c3 NaNdtype: categoryCategories (3, object): ['b' < 'c' < 'd']
Similarly, aCategoricalDtype
can be used with aDataFrame
to ensure that categoriesare consistent among all columns.
In [31]:frompandas.api.typesimportCategoricalDtypeIn [32]:df=pd.DataFrame({"A":list("abca"),"B":list("bccd")})In [33]:cat_type=CategoricalDtype(categories=list("abcd"),ordered=True)In [34]:df_cat=df.astype(cat_type)In [35]:df_cat["A"]Out[35]:0 a1 b2 c3 aName: A, dtype: categoryCategories (4, object): ['a' < 'b' < 'c' < 'd']In [36]:df_cat["B"]Out[36]:0 b1 c2 c3 dName: B, dtype: categoryCategories (4, object): ['a' < 'b' < 'c' < 'd']
Note
To perform table-wise conversion, where all labels in the entireDataFrame
are used ascategories for each column, thecategories
parameter can be determined programmatically bycategories=pd.unique(df.to_numpy().ravel())
.
If you already havecodes
andcategories
, you can use thefrom_codes()
constructor to save the factorize stepduring normal constructor mode:
In [37]:splitter=np.random.choice([0,1],5,p=[0.5,0.5])In [38]:s=pd.Series(pd.Categorical.from_codes(splitter,categories=["train","test"]))
Regaining original data#
To get back to the originalSeries
or NumPy array, useSeries.astype(original_dtype)
ornp.asarray(categorical)
:
In [39]:s=pd.Series(["a","b","c","a"])In [40]:sOut[40]:0 a1 b2 c3 adtype: objectIn [41]:s2=s.astype("category")In [42]:s2Out[42]:0 a1 b2 c3 adtype: categoryCategories (3, object): ['a', 'b', 'c']In [43]:s2.astype(str)Out[43]:0 a1 b2 c3 adtype: objectIn [44]:np.asarray(s2)Out[44]:array(['a', 'b', 'c', 'a'], dtype=object)
Note
In contrast to R’sfactor
function, categorical data is not converting input values tostrings; categories will end up the same data type as the original values.
Note
In contrast to R’sfactor
function, there is currently no way to assign/change labels atcreation time. Usecategories
to change the categories after creation time.
CategoricalDtype#
A categorical’s type is fully described by
categories
: a sequence of unique values and no missing valuesordered
: a boolean
This information can be stored in aCategoricalDtype
.Thecategories
argument is optional, which implies that the actual categoriesshould be inferred from whatever is present in the data when thepandas.Categorical
is created. The categories are assumed to be unorderedby default.
In [45]:frompandas.api.typesimportCategoricalDtypeIn [46]:CategoricalDtype(["a","b","c"])Out[46]:CategoricalDtype(categories=['a', 'b', 'c'], ordered=False, categories_dtype=object)In [47]:CategoricalDtype(["a","b","c"],ordered=True)Out[47]:CategoricalDtype(categories=['a', 'b', 'c'], ordered=True, categories_dtype=object)In [48]:CategoricalDtype()Out[48]:CategoricalDtype(categories=None, ordered=False, categories_dtype=None)
ACategoricalDtype
can be used in any place pandasexpects adtype
. For examplepandas.read_csv()
,pandas.DataFrame.astype()
, or in theSeries
constructor.
Note
As a convenience, you can use the string'category'
in place of aCategoricalDtype
when you want the default behavior ofthe categories being unordered, and equal to the set values present in thearray. In other words,dtype='category'
is equivalent todtype=CategoricalDtype()
.
Equality semantics#
Two instances ofCategoricalDtype
compare equalwhenever they have the same categories and order. When comparing twounordered categoricals, the order of thecategories
is not considered.
In [49]:c1=CategoricalDtype(["a","b","c"],ordered=False)# Equal, since order is not considered when ordered=FalseIn [50]:c1==CategoricalDtype(["b","c","a"],ordered=False)Out[50]:True# Unequal, since the second CategoricalDtype is orderedIn [51]:c1==CategoricalDtype(["a","b","c"],ordered=True)Out[51]:False
All instances ofCategoricalDtype
compare equal to the string'category'
.
In [52]:c1=="category"Out[52]:True
Description#
Usingdescribe()
on categorical data will produce similaroutput to aSeries
orDataFrame
of typestring
.
In [53]:cat=pd.Categorical(["a","c","c",np.nan],categories=["b","a","c"])In [54]:df=pd.DataFrame({"cat":cat,"s":["a","c","c",np.nan]})In [55]:df.describe()Out[55]: cat scount 3 3unique 2 2top c cfreq 2 2In [56]:df["cat"].describe()Out[56]:count 3unique 2top cfreq 2Name: cat, dtype: object
Working with categories#
Categorical data has acategories
and aordered
property, which list theirpossible values and whether the ordering matters or not. These properties areexposed ass.cat.categories
ands.cat.ordered
. If you don’t manuallyspecify categories and ordering, they are inferred from the passed arguments.
In [57]:s=pd.Series(["a","b","c","a"],dtype="category")In [58]:s.cat.categoriesOut[58]:Index(['a', 'b', 'c'], dtype='object')In [59]:s.cat.orderedOut[59]:False
It’s also possible to pass in the categories in a specific order:
In [60]:s=pd.Series(pd.Categorical(["a","b","c","a"],categories=["c","b","a"]))In [61]:s.cat.categoriesOut[61]:Index(['c', 'b', 'a'], dtype='object')In [62]:s.cat.orderedOut[62]:False
Note
New categorical data arenot automatically ordered. You must explicitlypassordered=True
to indicate an orderedCategorical
.
Note
The result ofunique()
is not always the same asSeries.cat.categories
,becauseSeries.unique()
has a couple of guarantees, namely that it returns categoriesin the order of appearance, and it only includes values that are actually present.
In [63]:s=pd.Series(list("babc")).astype(CategoricalDtype(list("abcd")))In [64]:sOut[64]:0 b1 a2 b3 cdtype: categoryCategories (4, object): ['a', 'b', 'c', 'd']# categoriesIn [65]:s.cat.categoriesOut[65]:Index(['a', 'b', 'c', 'd'], dtype='object')# uniquesIn [66]:s.unique()Out[66]:['b', 'a', 'c']Categories (4, object): ['a', 'b', 'c', 'd']
Renaming categories#
Renaming categories is done by using therename_categories()
method:
In [67]:s=pd.Series(["a","b","c","a"],dtype="category")In [68]:sOut[68]:0 a1 b2 c3 adtype: categoryCategories (3, object): ['a', 'b', 'c']In [69]:new_categories=["Group%s"%gforgins.cat.categories]In [70]:s=s.cat.rename_categories(new_categories)In [71]:sOut[71]:0 Group a1 Group b2 Group c3 Group adtype: categoryCategories (3, object): ['Group a', 'Group b', 'Group c']# You can also pass a dict-like object to map the renamingIn [72]:s=s.cat.rename_categories({1:"x",2:"y",3:"z"})In [73]:sOut[73]:0 Group a1 Group b2 Group c3 Group adtype: categoryCategories (3, object): ['Group a', 'Group b', 'Group c']
Note
In contrast to R’sfactor
, categorical data can have categories of other types than string.
Categories must be unique or aValueError
is raised:
In [74]:try: ....:s=s.cat.rename_categories([1,1,1]) ....:exceptValueErrorase: ....:print("ValueError:",str(e)) ....:ValueError: Categorical categories must be unique
Categories must also not beNaN
or aValueError
is raised:
In [75]:try: ....:s=s.cat.rename_categories([1,2,np.nan]) ....:exceptValueErrorase: ....:print("ValueError:",str(e)) ....:ValueError: Categorical categories cannot be null
Appending new categories#
Appending categories can be done by using theadd_categories()
method:
In [76]:s=s.cat.add_categories([4])In [77]:s.cat.categoriesOut[77]:Index(['Group a', 'Group b', 'Group c', 4], dtype='object')In [78]:sOut[78]:0 Group a1 Group b2 Group c3 Group adtype: categoryCategories (4, object): ['Group a', 'Group b', 'Group c', 4]
Removing categories#
Removing categories can be done by using theremove_categories()
method. Values which are removedare replaced bynp.nan
.:
In [79]:s=s.cat.remove_categories([4])In [80]:sOut[80]:0 Group a1 Group b2 Group c3 Group adtype: categoryCategories (3, object): ['Group a', 'Group b', 'Group c']
Removing unused categories#
Removing unused categories can also be done:
In [81]:s=pd.Series(pd.Categorical(["a","b","a"],categories=["a","b","c","d"]))In [82]:sOut[82]:0 a1 b2 adtype: categoryCategories (4, object): ['a', 'b', 'c', 'd']In [83]:s.cat.remove_unused_categories()Out[83]:0 a1 b2 adtype: categoryCategories (2, object): ['a', 'b']
Setting categories#
If you want to do remove and add new categories in one step (which has somespeed advantage), or simply set the categories to a predefined scale,useset_categories()
.
In [84]:s=pd.Series(["one","two","four","-"],dtype="category")In [85]:sOut[85]:0 one1 two2 four3 -dtype: categoryCategories (4, object): ['-', 'four', 'one', 'two']In [86]:s=s.cat.set_categories(["one","two","three","four"])In [87]:sOut[87]:0 one1 two2 four3 NaNdtype: categoryCategories (4, object): ['one', 'two', 'three', 'four']
Note
Be aware thatCategorical.set_categories()
cannot know whether some category is omittedintentionally or because it is misspelled or (under Python3) due to a type difference (e.g.,NumPy S1 dtype and Python strings). This can result in surprising behaviour!
Sorting and order#
If categorical data is ordered (s.cat.ordered==True
), then the order of the categories has ameaning and certain operations are possible. If the categorical is unordered,.min()/.max()
will raise aTypeError
.
In [88]:s=pd.Series(pd.Categorical(["a","b","c","a"],ordered=False))In [89]:s=s.sort_values()In [90]:s=pd.Series(["a","b","c","a"]).astype(CategoricalDtype(ordered=True))In [91]:s=s.sort_values()In [92]:sOut[92]:0 a3 a1 b2 cdtype: categoryCategories (3, object): ['a' < 'b' < 'c']In [93]:s.min(),s.max()Out[93]:('a', 'c')
You can set categorical data to be ordered by usingas_ordered()
or unordered by usingas_unordered()
. These will bydefault return anew object.
In [94]:s.cat.as_ordered()Out[94]:0 a3 a1 b2 cdtype: categoryCategories (3, object): ['a' < 'b' < 'c']In [95]:s.cat.as_unordered()Out[95]:0 a3 a1 b2 cdtype: categoryCategories (3, object): ['a', 'b', 'c']
Sorting will use the order defined by categories, not any lexical order present on the data type.This is even true for strings and numeric data:
In [96]:s=pd.Series([1,2,3,1],dtype="category")In [97]:s=s.cat.set_categories([2,3,1],ordered=True)In [98]:sOut[98]:0 11 22 33 1dtype: categoryCategories (3, int64): [2 < 3 < 1]In [99]:s=s.sort_values()In [100]:sOut[100]:1 22 30 13 1dtype: categoryCategories (3, int64): [2 < 3 < 1]In [101]:s.min(),s.max()Out[101]:(2, 1)
Reordering#
Reordering the categories is possible via theCategorical.reorder_categories()
andtheCategorical.set_categories()
methods. ForCategorical.reorder_categories()
, allold categories must be included in the new categories and no new categories are allowed. This willnecessarily make the sort order the same as the categories order.
In [102]:s=pd.Series([1,2,3,1],dtype="category")In [103]:s=s.cat.reorder_categories([2,3,1],ordered=True)In [104]:sOut[104]:0 11 22 33 1dtype: categoryCategories (3, int64): [2 < 3 < 1]In [105]:s=s.sort_values()In [106]:sOut[106]:1 22 30 13 1dtype: categoryCategories (3, int64): [2 < 3 < 1]In [107]:s.min(),s.max()Out[107]:(2, 1)
Note
Note the difference between assigning new categories and reordering the categories: the firstrenames categories and therefore the individual values in theSeries
, but if the firstposition was sorted last, the renamed value will still be sorted last. Reordering means that theway values are sorted is different afterwards, but not that individual values in theSeries
are changed.
Note
If theCategorical
is not ordered,Series.min()
andSeries.max()
will raiseTypeError
. Numeric operations like+
,-
,*
,/
and operations based on them(e.g.Series.median()
, which would need to compute the mean between two values if the lengthof an array is even) do not work and raise aTypeError
.
Multi column sorting#
A categorical dtyped column will participate in a multi-column sort in a similar manner to other columns.The ordering of the categorical is determined by thecategories
of that column.
In [108]:dfs=pd.DataFrame( .....:{ .....:"A":pd.Categorical( .....:list("bbeebbaa"), .....:categories=["e","a","b"], .....:ordered=True, .....:), .....:"B":[1,2,1,2,2,1,2,1], .....:} .....:) .....:In [109]:dfs.sort_values(by=["A","B"])Out[109]: A B2 e 13 e 27 a 16 a 20 b 15 b 11 b 24 b 2
Reordering thecategories
changes a future sort.
In [110]:dfs["A"]=dfs["A"].cat.reorder_categories(["a","b","e"])In [111]:dfs.sort_values(by=["A","B"])Out[111]: A B7 a 16 a 20 b 15 b 11 b 24 b 22 e 13 e 2
Comparisons#
Comparing categorical data with other objects is possible in three cases:
Comparing equality (
==
and!=
) to a list-like object (list, Series, array,…) of the same length as the categorical data.All comparisons (
==
,!=
,>
,>=
,<
, and<=
) of categorical data toanother categorical Series, whenordered==True
and thecategories
are the same.All comparisons of a categorical data to a scalar.
All other comparisons, especially “non-equality” comparisons of two categoricals with differentcategories or a categorical with any list-like object, will raise aTypeError
.
Note
Any “non-equality” comparisons of categorical data with aSeries
,np.array
,list
orcategorical data with different categories or ordering will raise aTypeError
because customcategories ordering could be interpreted in two ways: one with taking into account theordering and one without.
In [112]:cat=pd.Series([1,2,3]).astype(CategoricalDtype([3,2,1],ordered=True))In [113]:cat_base=pd.Series([2,2,2]).astype(CategoricalDtype([3,2,1],ordered=True))In [114]:cat_base2=pd.Series([2,2,2]).astype(CategoricalDtype(ordered=True))In [115]:catOut[115]:0 11 22 3dtype: categoryCategories (3, int64): [3 < 2 < 1]In [116]:cat_baseOut[116]:0 21 22 2dtype: categoryCategories (3, int64): [3 < 2 < 1]In [117]:cat_base2Out[117]:0 21 22 2dtype: categoryCategories (1, int64): [2]
Comparing to a categorical with the same categories and ordering or to a scalar works:
In [118]:cat>cat_baseOut[118]:0 True1 False2 Falsedtype: boolIn [119]:cat>2Out[119]:0 True1 False2 Falsedtype: bool
Equality comparisons work with any list-like object of same length and scalars:
In [120]:cat==cat_baseOut[120]:0 False1 True2 Falsedtype: boolIn [121]:cat==np.array([1,2,3])Out[121]:0 True1 True2 Truedtype: boolIn [122]:cat==2Out[122]:0 False1 True2 Falsedtype: bool
This doesn’t work because the categories are not the same:
In [123]:try: .....:cat>cat_base2 .....:exceptTypeErrorase: .....:print("TypeError:",str(e)) .....:TypeError: Categoricals can only be compared if 'categories' are the same.
If you want to do a “non-equality” comparison of a categorical series with a list-like objectwhich is not categorical data, you need to be explicit and convert the categorical data back tothe original values:
In [124]:base=np.array([1,2,3])In [125]:try: .....:cat>base .....:exceptTypeErrorase: .....:print("TypeError:",str(e)) .....:TypeError: Cannot compare a Categorical for op __gt__ with type <class 'numpy.ndarray'>.If you want to compare values, use 'np.asarray(cat) <op> other'.In [126]:np.asarray(cat)>baseOut[126]:array([False, False, False])
When you compare two unordered categoricals with the same categories, the order is not considered:
In [127]:c1=pd.Categorical(["a","b"],categories=["a","b"],ordered=False)In [128]:c2=pd.Categorical(["a","b"],categories=["b","a"],ordered=False)In [129]:c1==c2Out[129]:array([ True, True])
Operations#
Apart fromSeries.min()
,Series.max()
andSeries.mode()
, thefollowing operations are possible with categorical data:
Series
methods likeSeries.value_counts()
will use all categories,even if some categories are not present in the data:
In [130]:s=pd.Series(pd.Categorical(["a","b","c","c"],categories=["c","a","b","d"]))In [131]:s.value_counts()Out[131]:c 2a 1b 1d 0Name: count, dtype: int64
DataFrame
methods likeDataFrame.sum()
also show “unused” categories whenobserved=False
.
In [132]:columns=pd.Categorical( .....:["One","One","Two"],categories=["One","Two","Three"],ordered=True .....:) .....:In [133]:df=pd.DataFrame( .....:data=[[1,2,3],[4,5,6]], .....:columns=pd.MultiIndex.from_arrays([["A","B","B"],columns]), .....:).T .....:In [134]:df.groupby(level=1,observed=False).sum()Out[134]: 0 1One 3 9Two 3 6Three 0 0
Groupby will also show “unused” categories whenobserved=False
:
In [135]:cats=pd.Categorical( .....:["a","b","b","b","c","c","c"],categories=["a","b","c","d"] .....:) .....:In [136]:df=pd.DataFrame({"cats":cats,"values":[1,2,2,2,3,4,5]})In [137]:df.groupby("cats",observed=False).mean()Out[137]: valuescatsa 1.0b 2.0c 4.0d NaNIn [138]:cats2=pd.Categorical(["a","a","b","b"],categories=["a","b","c"])In [139]:df2=pd.DataFrame( .....:{ .....:"cats":cats2, .....:"B":["c","d","c","d"], .....:"values":[1,2,3,4], .....:} .....:) .....:In [140]:df2.groupby(["cats","B"],observed=False).mean()Out[140]: valuescats Ba c 1.0 d 2.0b c 3.0 d 4.0c c NaN d NaN
Pivot tables:
In [141]:raw_cat=pd.Categorical(["a","a","b","b"],categories=["a","b","c"])In [142]:df=pd.DataFrame({"A":raw_cat,"B":["c","d","c","d"],"values":[1,2,3,4]})In [143]:pd.pivot_table(df,values="values",index=["A","B"],observed=False)Out[143]: valuesA Ba c 1.0 d 2.0b c 3.0 d 4.0
Data munging#
The optimized pandas data access methods.loc
,.iloc
,.at
, and.iat
,work as normal. The only difference is the return type (for getting) andthat only values already incategories
can be assigned.
Getting#
If the slicing operation returns either aDataFrame
or a column of typeSeries
, thecategory
dtype is preserved.
In [144]:idx=pd.Index(["h","i","j","k","l","m","n"])In [145]:cats=pd.Series(["a","b","b","b","c","c","c"],dtype="category",index=idx)In [146]:values=[1,2,2,2,3,4,5]In [147]:df=pd.DataFrame({"cats":cats,"values":values},index=idx)In [148]:df.iloc[2:4,:]Out[148]: cats valuesj b 2k b 2In [149]:df.iloc[2:4,:].dtypesOut[149]:cats categoryvalues int64dtype: objectIn [150]:df.loc["h":"j","cats"]Out[150]:h ai bj bName: cats, dtype: categoryCategories (3, object): ['a', 'b', 'c']In [151]:df[df["cats"]=="b"]Out[151]: cats valuesi b 2j b 2k b 2
An example where the category type is not preserved is if you take one singlerow: the resultingSeries
is of dtypeobject
:
# get the complete "h" row as a SeriesIn [152]:df.loc["h",:]Out[152]:cats avalues 1Name: h, dtype: object
Returning a single item from categorical data will also return the value, not a categoricalof length “1”.
In [153]:df.iat[0,0]Out[153]:'a'In [154]:df["cats"]=df["cats"].cat.rename_categories(["x","y","z"])In [155]:df.at["h","cats"]# returns a stringOut[155]:'x'
Note
The is in contrast to R’sfactor
function, wherefactor(c(1,2,3))[1]
returns a single valuefactor
.
To get a single valueSeries
of typecategory
, you pass in a list witha single value:
In [156]:df.loc[["h"],"cats"]Out[156]:h xName: cats, dtype: categoryCategories (3, object): ['x', 'y', 'z']
String and datetime accessors#
The accessors.dt
and.str
will work if thes.cat.categories
are ofan appropriate type:
In [157]:str_s=pd.Series(list("aabb"))In [158]:str_cat=str_s.astype("category")In [159]:str_catOut[159]:0 a1 a2 b3 bdtype: categoryCategories (2, object): ['a', 'b']In [160]:str_cat.str.contains("a")Out[160]:0 True1 True2 False3 Falsedtype: boolIn [161]:date_s=pd.Series(pd.date_range("1/1/2015",periods=5))In [162]:date_cat=date_s.astype("category")In [163]:date_catOut[163]:0 2015-01-011 2015-01-022 2015-01-033 2015-01-044 2015-01-05dtype: categoryCategories (5, datetime64[ns]): [2015-01-01, 2015-01-02, 2015-01-03, 2015-01-04, 2015-01-05]In [164]:date_cat.dt.dayOut[164]:0 11 22 33 44 5dtype: int32
Note
The returnedSeries
(orDataFrame
) is of the same type as if you used the.str.<method>
/.dt.<method>
on aSeries
of that type (and not oftypecategory
!).
That means, that the returned values from methods and properties on the accessors of aSeries
and the returned values from methods and properties on the accessors of thisSeries
transformed to one of typecategory
will be equal:
In [165]:ret_s=str_s.str.contains("a")In [166]:ret_cat=str_cat.str.contains("a")In [167]:ret_s.dtype==ret_cat.dtypeOut[167]:TrueIn [168]:ret_s==ret_catOut[168]:0 True1 True2 True3 Truedtype: bool
Note
The work is done on thecategories
and then a newSeries
is constructed. This hassome performance implication if you have aSeries
of type string, where lots of elementsare repeated (i.e. the number of unique elements in theSeries
is a lot smaller than thelength of theSeries
). In this case it can be faster to convert the originalSeries
to one of typecategory
and use.str.<method>
or.dt.<property>
on that.
Setting#
Setting values in a categorical column (orSeries
) works as long as thevalue is included in thecategories
:
In [169]:idx=pd.Index(["h","i","j","k","l","m","n"])In [170]:cats=pd.Categorical(["a","a","a","a","a","a","a"],categories=["a","b"])In [171]:values=[1,1,1,1,1,1,1]In [172]:df=pd.DataFrame({"cats":cats,"values":values},index=idx)In [173]:df.iloc[2:4,:]=[["b",2],["b",2]]In [174]:dfOut[174]: cats valuesh a 1i a 1j b 2k b 2l a 1m a 1n a 1In [175]:try: .....:df.iloc[2:4,:]=[["c",3],["c",3]] .....:exceptTypeErrorase: .....:print("TypeError:",str(e)) .....:TypeError: Cannot setitem on a Categorical with a new category, set the categories first
Setting values by assigning categorical data will also check that thecategories
match:
In [176]:df.loc["j":"k","cats"]=pd.Categorical(["a","a"],categories=["a","b"])In [177]:dfOut[177]: cats valuesh a 1i a 1j a 2k a 2l a 1m a 1n a 1In [178]:try: .....:df.loc["j":"k","cats"]=pd.Categorical(["b","b"],categories=["a","b","c"]) .....:exceptTypeErrorase: .....:print("TypeError:",str(e)) .....:TypeError: Cannot set a Categorical with another, without identical categories
Assigning aCategorical
to parts of a column of other types will use the values:
In [179]:df=pd.DataFrame({"a":[1,1,1,1,1],"b":["a","a","a","a","a"]})In [180]:df.loc[1:2,"a"]=pd.Categorical(["b","b"],categories=["a","b"])In [181]:df.loc[2:3,"b"]=pd.Categorical(["b","b"],categories=["a","b"])In [182]:dfOut[182]: a b0 1 a1 b a2 b b3 1 b4 1 aIn [183]:df.dtypesOut[183]:a objectb objectdtype: object
Merging / concatenation#
By default, combiningSeries
orDataFrames
which contain the samecategories results incategory
dtype, otherwise results will depend on thedtype of the underlying categories. Merges that result in non-categoricaldtypes will likely have higher memory usage. Use.astype
orunion_categoricals
to ensurecategory
results.
In [184]:frompandas.api.typesimportunion_categoricals# same categoriesIn [185]:s1=pd.Series(["a","b"],dtype="category")In [186]:s2=pd.Series(["a","b","a"],dtype="category")In [187]:pd.concat([s1,s2])Out[187]:0 a1 b0 a1 b2 adtype: categoryCategories (2, object): ['a', 'b']# different categoriesIn [188]:s3=pd.Series(["b","c"],dtype="category")In [189]:pd.concat([s1,s3])Out[189]:0 a1 b0 b1 cdtype: object# Output dtype is inferred based on categories valuesIn [190]:int_cats=pd.Series([1,2],dtype="category")In [191]:float_cats=pd.Series([3.0,4.0],dtype="category")In [192]:pd.concat([int_cats,float_cats])Out[192]:0 1.01 2.00 3.01 4.0dtype: float64In [193]:pd.concat([s1,s3]).astype("category")Out[193]:0 a1 b0 b1 cdtype: categoryCategories (3, object): ['a', 'b', 'c']In [194]:union_categoricals([s1.array,s3.array])Out[194]:['a', 'b', 'b', 'c']Categories (3, object): ['a', 'b', 'c']
The following table summarizes the results of mergingCategoricals
:
arg1 | arg2 | identical | result |
---|---|---|---|
category | category | True | category |
category (object) | category (object) | False | object (dtype is inferred) |
category (int) | category (float) | False | float (dtype is inferred) |
Unioning#
If you want to combine categoricals that do not necessarily have the samecategories, theunion_categoricals()
function willcombine a list-like of categoricals. The new categories will be the union ofthe categories being combined.
In [195]:frompandas.api.typesimportunion_categoricalsIn [196]:a=pd.Categorical(["b","c"])In [197]:b=pd.Categorical(["a","b"])In [198]:union_categoricals([a,b])Out[198]:['b', 'c', 'a', 'b']Categories (3, object): ['b', 'c', 'a']
By default, the resulting categories will be ordered asthey appear in the data. If you want the categories tobe lexsorted, usesort_categories=True
argument.
In [199]:union_categoricals([a,b],sort_categories=True)Out[199]:['b', 'c', 'a', 'b']Categories (3, object): ['a', 'b', 'c']
union_categoricals
also works with the “easy” case of combining twocategoricals of the same categories and order information(e.g. what you could alsoappend
for).
In [200]:a=pd.Categorical(["a","b"],ordered=True)In [201]:b=pd.Categorical(["a","b","a"],ordered=True)In [202]:union_categoricals([a,b])Out[202]:['a', 'b', 'a', 'b', 'a']Categories (2, object): ['a' < 'b']
The below raisesTypeError
because the categories are ordered and not identical.
In [203]:a=pd.Categorical(["a","b"],ordered=True)In [204]:b=pd.Categorical(["a","b","c"],ordered=True)In [205]:union_categoricals([a,b])---------------------------------------------------------------------------TypeErrorTraceback (most recent call last)CellIn[205],line1---->1union_categoricals([a,b])File ~/work/pandas/pandas/pandas/core/dtypes/concat.py:341, inunion_categoricals(to_union, sort_categories, ignore_order)339ifall(c.orderedforcinto_union):340msg="to union ordered Categoricals, all categories must be the same"-->341raiseTypeError(msg)342raiseTypeError("Categorical.ordered must be the same")344ifignore_order:TypeError: to union ordered Categoricals, all categories must be the same
Ordered categoricals with different categories or orderings can be combined byusing theignore_ordered=True
argument.
In [206]:a=pd.Categorical(["a","b","c"],ordered=True)In [207]:b=pd.Categorical(["c","b","a"],ordered=True)In [208]:union_categoricals([a,b],ignore_order=True)Out[208]:['a', 'b', 'c', 'c', 'b', 'a']Categories (3, object): ['a', 'b', 'c']
union_categoricals()
also works with aCategoricalIndex
, orSeries
containing categorical data, but note thatthe resulting array will always be a plainCategorical
:
In [209]:a=pd.Series(["b","c"],dtype="category")In [210]:b=pd.Series(["a","b"],dtype="category")In [211]:union_categoricals([a,b])Out[211]:['b', 'c', 'a', 'b']Categories (3, object): ['b', 'c', 'a']
Note
union_categoricals
may recode the integer codes for categorieswhen combining categoricals. This is likely what you want,but if you are relying on the exact numbering of the categories, beaware.
In [212]:c1=pd.Categorical(["b","c"])In [213]:c2=pd.Categorical(["a","b"])In [214]:c1Out[214]:['b', 'c']Categories (2, object): ['b', 'c']# "b" is coded to 0In [215]:c1.codesOut[215]:array([0, 1], dtype=int8)In [216]:c2Out[216]:['a', 'b']Categories (2, object): ['a', 'b']# "b" is coded to 1In [217]:c2.codesOut[217]:array([0, 1], dtype=int8)In [218]:c=union_categoricals([c1,c2])In [219]:cOut[219]:['b', 'c', 'a', 'b']Categories (3, object): ['b', 'c', 'a']# "b" is coded to 0 throughout, same as c1, different from c2In [220]:c.codesOut[220]:array([0, 1, 2, 0], dtype=int8)
Getting data in/out#
You can write data that containscategory
dtypes to aHDFStore
.Seehere for an example and caveats.
It is also possible to write data to and reading data fromStata format files.Seehere for an example and caveats.
Writing to a CSV file will convert the data, effectively removing any information about thecategorical (categories and ordering). So if you read back the CSV file you have to convert therelevant columns back tocategory
and assign the right categories and categories ordering.
In [221]:importioIn [222]:s=pd.Series(pd.Categorical(["a","b","b","a","a","d"]))# rename the categoriesIn [223]:s=s.cat.rename_categories(["very good","good","bad"])# reorder the categories and add missing categoriesIn [224]:s=s.cat.set_categories(["very bad","bad","medium","good","very good"])In [225]:df=pd.DataFrame({"cats":s,"vals":[1,2,3,4,5,6]})In [226]:csv=io.StringIO()In [227]:df.to_csv(csv)In [228]:df2=pd.read_csv(io.StringIO(csv.getvalue()))In [229]:df2.dtypesOut[229]:Unnamed: 0 int64cats objectvals int64dtype: objectIn [230]:df2["cats"]Out[230]:0 very good1 good2 good3 very good4 very good5 badName: cats, dtype: object# Redo the categoryIn [231]:df2["cats"]=df2["cats"].astype("category")In [232]:df2["cats"]=df2["cats"].cat.set_categories( .....:["very bad","bad","medium","good","very good"] .....:) .....:In [233]:df2.dtypesOut[233]:Unnamed: 0 int64cats categoryvals int64dtype: objectIn [234]:df2["cats"]Out[234]:0 very good1 good2 good3 very good4 very good5 badName: cats, dtype: categoryCategories (5, object): ['very bad', 'bad', 'medium', 'good', 'very good']
The same holds for writing to a SQL database withto_sql
.
Missing data#
pandas primarily uses the valuenp.nan
to represent missing data. It is bydefault not included in computations. See theMissing Data section.
Missing values shouldnot be included in the Categorical’scategories
,only in thevalues
.Instead, it is understood that NaN is different, and is always a possibility.When working with the Categorical’scodes
, missing values will always havea code of-1
.
In [235]:s=pd.Series(["a","b",np.nan,"a"],dtype="category")# only two categoriesIn [236]:sOut[236]:0 a1 b2 NaN3 adtype: categoryCategories (2, object): ['a', 'b']In [237]:s.cat.codesOut[237]:0 01 12 -13 0dtype: int8
Methods for working with missing data, e.g.isna()
,fillna()
,dropna()
, all work normally:
In [238]:s=pd.Series(["a","b",np.nan],dtype="category")In [239]:sOut[239]:0 a1 b2 NaNdtype: categoryCategories (2, object): ['a', 'b']In [240]:pd.isna(s)Out[240]:0 False1 False2 Truedtype: boolIn [241]:s.fillna("a")Out[241]:0 a1 b2 adtype: categoryCategories (2, object): ['a', 'b']
Differences to R’sfactor
#
The following differences to R’s factor functions can be observed:
R’s
levels
are namedcategories
.R’s
levels
are always of type string, whilecategories
in pandas can be of any dtype.It’s not possible to specify labels at creation time. Use
s.cat.rename_categories(new_labels)
afterwards.In contrast to R’s
factor
function, using categorical data as the sole input to create anew categorical series willnot remove unused categories but create a new categorical serieswhich is equal to the passed in one!R allows for missing values to be included in its
levels
(pandas’categories
). pandasdoes not allowNaN
categories, but missing values can still be in thevalues
.
Gotchas#
Memory usage#
The memory usage of aCategorical
is proportional to the number of categories plus the length of the data. In contrast,anobject
dtype is a constant times the length of the data.
In [242]:s=pd.Series(["foo","bar"]*1000)# object dtypeIn [243]:s.nbytesOut[243]:16000# category dtypeIn [244]:s.astype("category").nbytesOut[244]:2016
Note
If the number of categories approaches the length of the data, theCategorical
will use nearly the same ormore memory than an equivalentobject
dtype representation.
In [245]:s=pd.Series(["foo%04d"%iforiinrange(2000)])# object dtypeIn [246]:s.nbytesOut[246]:16000# category dtypeIn [247]:s.astype("category").nbytesOut[247]:20000
Categorical
is not anumpy
array#
Currently, categorical data and the underlyingCategorical
is implemented as a Pythonobject and not as a low-level NumPy array dtype. This leads to some problems.
NumPy itself doesn’t know about the newdtype
:
In [248]:try: .....:np.dtype("category") .....:exceptTypeErrorase: .....:print("TypeError:",str(e)) .....:TypeError: data type 'category' not understoodIn [249]:dtype=pd.Categorical(["a"]).dtypeIn [250]:try: .....:np.dtype(dtype) .....:exceptTypeErrorase: .....:print("TypeError:",str(e)) .....:TypeError: Cannot interpret 'CategoricalDtype(categories=['a'], ordered=False, categories_dtype=object)' as a data type
Dtype comparisons work:
In [251]:dtype==np.str_Out[251]:FalseIn [252]:np.str_==dtypeOut[252]:False
To check if a Series contains Categorical data, usehasattr(s,'cat')
:
In [253]:hasattr(pd.Series(["a"],dtype="category"),"cat")Out[253]:TrueIn [254]:hasattr(pd.Series(["a"]),"cat")Out[254]:False
Using NumPy functions on aSeries
of typecategory
should not work asCategoricals
are not numeric data (even in the case that.categories
is numeric).
In [255]:s=pd.Series(pd.Categorical([1,2,3,4]))In [256]:try: .....:np.sum(s) .....:exceptTypeErrorase: .....:print("TypeError:",str(e)) .....:TypeError: 'Categorical' with dtype category does not support reduction 'sum'
Note
If such a function works, please file a bug atpandas-dev/pandas!
dtype in apply#
pandas currently does not preserve the dtype in apply functions: If you apply along rows you getaSeries
ofobject
dtype
(same as getting a row -> getting one element will return abasic type) and applying along columns will also convert to object.NaN
values are unaffected.You can usefillna
to handle missing values before applying a function.
In [257]:df=pd.DataFrame( .....:{ .....:"a":[1,2,3,4], .....:"b":["a","b","c","d"], .....:"cats":pd.Categorical([1,2,3,2]), .....:} .....:) .....:In [258]:df.apply(lambdarow:type(row["cats"]),axis=1)Out[258]:0 <class 'int'>1 <class 'int'>2 <class 'int'>3 <class 'int'>dtype: objectIn [259]:df.apply(lambdacol:col.dtype,axis=0)Out[259]:a int64b objectcats categorydtype: object
Categorical index#
CategoricalIndex
is a type of index that is useful for supportingindexing with duplicates. This is a container around aCategorical
and allows efficient indexing and storage of an index with a large number of duplicated elements.See theadvanced indexing docs for a more detailedexplanation.
Setting the index will create aCategoricalIndex
:
In [260]:cats=pd.Categorical([1,2,3,4],categories=[4,2,3,1])In [261]:strings=["a","b","c","d"]In [262]:values=[4,2,3,1]In [263]:df=pd.DataFrame({"strings":strings,"values":values},index=cats)In [264]:df.indexOut[264]:CategoricalIndex([1, 2, 3, 4], categories=[4, 2, 3, 1], ordered=False, dtype='category')# This now sorts by the categories orderIn [265]:df.sort_index()Out[265]: strings values4 d 12 b 23 c 31 a 4
Side effects#
Constructing aSeries
from aCategorical
will not copy the inputCategorical
. This means that changes to theSeries
will in most caseschange the originalCategorical
:
In [266]:cat=pd.Categorical([1,2,3,10],categories=[1,2,3,4,10])In [267]:s=pd.Series(cat,name="cat")In [268]:catOut[268]:[1, 2, 3, 10]Categories (5, int64): [1, 2, 3, 4, 10]In [269]:s.iloc[0:2]=10In [270]:catOut[270]:[10, 10, 3, 10]Categories (5, int64): [1, 2, 3, 4, 10]
Usecopy=True
to prevent such a behaviour or simply don’t reuseCategoricals
:
In [271]:cat=pd.Categorical([1,2,3,10],categories=[1,2,3,4,10])In [272]:s=pd.Series(cat,name="cat",copy=True)In [273]:catOut[273]:[1, 2, 3, 10]Categories (5, int64): [1, 2, 3, 4, 10]In [274]:s.iloc[0:2]=10In [275]:catOut[275]:[1, 2, 3, 10]Categories (5, int64): [1, 2, 3, 4, 10]
Note
This also happens in some cases when you supply a NumPy array instead of aCategorical
:using an int array (e.g.np.array([1,2,3,4])
) will exhibit the same behavior, while usinga string array (e.g.np.array(["a","b","c","a"])
) will not.