
- Python Pandas - Home
- Python Pandas - Introduction
- Python Pandas - Environment Setup
- Python Pandas - Basics
- Python Pandas - Introduction to Data Structures
- Python Pandas - Index Objects
- Python Pandas - Panel
- Python Pandas - Basic Functionality
- Python Pandas - Indexing & Selecting Data
- Python Pandas - Series
- Python Pandas - Series
- Python Pandas - Slicing a Series Object
- Python Pandas - Attributes of a Series Object
- Python Pandas - Arithmetic Operations on Series Object
- Python Pandas - Converting Series to Other Objects
- Python Pandas - DataFrame
- Python Pandas - DataFrame
- Python Pandas - Accessing DataFrame
- Python Pandas - Slicing a DataFrame Object
- Python Pandas - Modifying DataFrame
- Python Pandas - Removing Rows from a DataFrame
- Python Pandas - Arithmetic Operations on DataFrame
- Python Pandas - IO Tools
- Python Pandas - IO Tools
- Python Pandas - Working with CSV Format
- Python Pandas - Reading & Writing JSON Files
- Python Pandas - Reading Data from an Excel File
- Python Pandas - Writing Data to Excel Files
- Python Pandas - Working with HTML Data
- Python Pandas - Clipboard
- Python Pandas - Working with HDF5 Format
- Python Pandas - Comparison with SQL
- Python Pandas - Data Handling
- Python Pandas - Sorting
- Python Pandas - Reindexing
- Python Pandas - Iteration
- Python Pandas - Concatenation
- Python Pandas - Statistical Functions
- Python Pandas - Descriptive Statistics
- Python Pandas - Working with Text Data
- Python Pandas - Function Application
- Python Pandas - Options & Customization
- Python Pandas - Window Functions
- Python Pandas - Aggregations
- Python Pandas - Merging/Joining
- Python Pandas - MultiIndex
- Python Pandas - Basics of MultiIndex
- Python Pandas - Indexing with MultiIndex
- Python Pandas - Advanced Reindexing with MultiIndex
- Python Pandas - Renaming MultiIndex Labels
- Python Pandas - Sorting a MultiIndex
- Python Pandas - Binary Operations
- Python Pandas - Binary Comparison Operations
- Python Pandas - Boolean Indexing
- Python Pandas - Boolean Masking
- Python Pandas - Data Reshaping & Pivoting
- Python Pandas - Pivoting
- Python Pandas - Stacking & Unstacking
- Python Pandas - Melting
- Python Pandas - Computing Dummy Variables
- Python Pandas - Categorical Data
- Python Pandas - Categorical Data
- Python Pandas - Ordering & Sorting Categorical Data
- Python Pandas - Comparing Categorical Data
- Python Pandas - Handling Missing Data
- Python Pandas - Missing Data
- Python Pandas - Filling Missing Data
- Python Pandas - Interpolation of Missing Values
- Python Pandas - Dropping Missing Data
- Python Pandas - Calculations with Missing Data
- Python Pandas - Handling Duplicates
- Python Pandas - Duplicated Data
- Python Pandas - Counting & Retrieving Unique Elements
- Python Pandas - Duplicated Labels
- Python Pandas - Grouping & Aggregation
- Python Pandas - GroupBy
- Python Pandas - Time-series Data
- Python Pandas - Date Functionality
- Python Pandas - Timedelta
- Python Pandas - Sparse Data Structures
- Python Pandas - Sparse Data
- Python Pandas - Visualization
- Python Pandas - Visualization
- Python Pandas - Additional Concepts
- Python Pandas - Caveats & Gotchas
Python Pandas - Categorical Data
In pandas, categorical data refers to a data type that represents categorical variables, similar to the concept of factors in R. It is a specialized data type designed for handling categorical variables, commonly used in statistics. A categorical variable can represent values like "male" or "female," or ratings on a scale such as "poor," "average," and "excellent." Unlike numerical data, you cannot perform mathematical operations like addition or division on categorical data.
In Pandas, categorical data is stored more efficiently because it uses a combination of an array of category values and an array of integer codes that refer to those categories. This saves memory and improves performance when working with large datasets containing repeated values.
The categorical data type is useful in the following cases −
A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory.
The lexical order of a variable is not the same as the logical order (one, two, three). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order.
As a signal to other python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).
In this tutorial we will learn about basics of working with categorical data in Pandas, including series and DataFrame creation, controlling behavior, and regaining original data from categorical values.
Series and DataFrame Creation with Categorical Data
Pandas Series or DataFrame object can be created directly with the categorical data using thedtype="category" parameter of the PandasSeries() orDataFrame() constructors.
Example: Series Creation with Categorical Data
Following is the basic example of creating a Pandas Series object with the categorical data.
import pandas as pd# Create Series object with categorical datas = pd.Series(["a", "b", "c", "a"], dtype="category")# Display the categorical Series print('Series with Categorical Data:\n', s)Following is the output of the above code −
Series with Categorical Data:0 a1 b2 c3 adtype: categoryCategories (3, object): ['a', 'b', 'c']
Example: Converting an Existing DataFrame Column to Categorical
This example demonstrates converting an existing Pandas DataFrame column to categorical data type using theastype() method.
import pandas as pdimport numpy as np# Create a DataFrame df = pd.DataFrame({"Col_a": list("aeeioou"), "Col_b": range(7)})# Display the Input DataFrameprint('Input DataFrame:\n',df)print('\nVerify the Data type of each column:\n', df.dtypes)# Convert the Data type of col_a to categoricaldf['Col_a'] = df["Col_a"].astype("category")# Display the Input DataFrameprint('\nConverted DataFrame:\n',df)print('\nVerify the Data type of each column:\n', df.dtypes)Following is the output of the above code −
Input DataFrame:| Col_a | Col_b | |
|---|---|---|
| 0 | a | 0 |
| 1 | e | 1 |
| 2 | e | 2 |
| 3 | i | 3 |
| 4 | o | 4 |
| 5 | o | 5 |
| 6 | u | 6 |
Verify the Data type of each column:Col_a objectCol_b int64dtype: objectConverted DataFrame:
| Col_a | Col_b | |
|---|---|---|
| 0 | a | 0 |
| 1 | e | 1 |
| 2 | e | 2 |
| 3 | i | 3 |
| 4 | o | 4 |
| 5 | o | 5 |
| 6 | u | 6 |
Verify the Data type of each column:Col_a categoryCol_b int64dtype: object
Controlling Behavior of the Categorical Data
By default, Pandas infers categories from the data and treats them as unordered. To control the behavior, you can use theCategoricalDtype class from thepandas.api.types module.
Example
This example demonstrates how to apply theCategoricalDtype to a whole DataFrame.
import pandas as pdfrom pandas.api.types import CategoricalDtype# Create a DataFrame df = pd.DataFrame({"A": list("abca"), "B": list("bccd")})# Display the Input DataFrameprint('Input DataFrame:\n',df)print('\nVerify the Data type of each column:\n', df.dtypes)# Applying CategoricalDtype to a DataFramecat_type = CategoricalDtype(categories=list("abcd"), ordered=True)df_cat = df.astype(cat_type)# Display the Input DataFrameprint('\nConverted DataFrame:\n', df_cat)print('\nVerify the Data type of each column:\n', df_cat.dtypes)Following is the output of the above code −
Input DataFrame:| A | B | |
|---|---|---|
| 0 | a | b |
| 1 | b | c |
| 2 | c | c |
| 3 | a | d |
Verify the Data type of each column:A objectB objectdtype: objectConverted DataFrame:
| A | B | |
|---|---|---|
| 0 | a | b |
| 1 | b | c |
| 2 | c | c |
| 3 | a | d |
Verify the Data type of each column:A categoryB category
Converting the Categorical Data Back to Original
After converting a Series to categorical data, you can convert it back to its original form usingSeries.astype() ornp.asarray().
Example
This example converts the categorical data of Series object back to the object data type using theastype() method.
import pandas as pd# Create Series object with categorical datas = pd.Series(["a", "b", "c", "a"], dtype="category")# Display the categorical Series print('Series with Categorical Data:\n', s)# Display the converted Seriesprint('Converted Series back to original:\n ', s.astype(str))Following is the output of the above code −
Series with Categorical Data: 0 a1 b2 c3 adtype: categoryCategories (3, object): ['a', 'b', 'c']Converted Series back to original: 0 a1 b2 c3 adtype: object
Description to a Data Column
Using the.describe() command on the categorical data, we get similar output to aSeries orDataFrame of thetype string.
Example
The following example demonstrates how to get the description of Pandas categorical DataFrame using thedescribe() method.
import pandas as pdimport numpy as npcat = pd.Categorical(["a", "c", "c", np.nan], categories=["b", "a", "c"])df = pd.DataFrame({"cat":cat, "s":["a", "c", "c", np.nan]})print("Description for whole DataFrame:")print(df.describe())print("\nDescription only for a DataFrame column:")print(df["cat"].describe())Itsoutput is as follows −
Description for whole DataFrame:| cat | s | ||
|---|---|---|---|
| count | 3 | 3 | |
| unique | 2 | 2 | |
| top | c | c | |
| freq | 2 | 2 |
Description only for a DataFrame column:count 3unique 2top cfreq 2Name: cat, dtype: object