Python Pandas - Home
Python Pandas - Introduction
Python Pandas - Environment Setup
Python Pandas - Basics
Python Pandas - Introduction to Data Structures
Python Pandas - Index Objects
Python Pandas - Panel
Python Pandas - Basic Functionality
Python Pandas - Indexing & Selecting Data
Python Pandas - Series
Python Pandas - Series
Python Pandas - Slicing a Series Object
Python Pandas - Attributes of a Series Object
Python Pandas - Arithmetic Operations on Series Object
Python Pandas - Converting Series to Other Objects
Python Pandas - DataFrame
Python Pandas - DataFrame
Python Pandas - Accessing DataFrame
Python Pandas - Slicing a DataFrame Object
Python Pandas - Modifying DataFrame
Python Pandas - Removing Rows from a DataFrame
Python Pandas - Arithmetic Operations on DataFrame
Python Pandas - IO Tools
Python Pandas - IO Tools
Python Pandas - Working with CSV Format
Python Pandas - Reading & Writing JSON Files
Python Pandas - Reading Data from an Excel File
Python Pandas - Writing Data to Excel Files
Python Pandas - Working with HTML Data
Python Pandas - Clipboard
Python Pandas - Working with HDF5 Format
Python Pandas - Comparison with SQL
Python Pandas - Data Handling
Python Pandas - Sorting
Python Pandas - Reindexing
Python Pandas - Iteration
Python Pandas - Concatenation
Python Pandas - Statistical Functions
Python Pandas - Descriptive Statistics
Python Pandas - Working with Text Data
Python Pandas - Function Application
Python Pandas - Options & Customization
Python Pandas - Window Functions
Python Pandas - Aggregations
Python Pandas - Merging/Joining
Python Pandas - MultiIndex
Python Pandas - Basics of MultiIndex
Python Pandas - Indexing with MultiIndex
Python Pandas - Advanced Reindexing with MultiIndex
Python Pandas - Renaming MultiIndex Labels
Python Pandas - Sorting a MultiIndex
Python Pandas - Binary Operations
Python Pandas - Binary Comparison Operations
Python Pandas - Boolean Indexing
Python Pandas - Boolean Masking
Python Pandas - Data Reshaping & Pivoting
Python Pandas - Pivoting
Python Pandas - Stacking & Unstacking
Python Pandas - Melting
Python Pandas - Computing Dummy Variables
Python Pandas - Categorical Data
Python Pandas - Categorical Data
Python Pandas - Ordering & Sorting Categorical Data
Python Pandas - Comparing Categorical Data
Python Pandas - Handling Missing Data
Python Pandas - Missing Data
Python Pandas - Filling Missing Data
Python Pandas - Interpolation of Missing Values
Python Pandas - Dropping Missing Data
Python Pandas - Calculations with Missing Data
Python Pandas - Handling Duplicates
Python Pandas - Duplicated Data
Python Pandas - Counting & Retrieving Unique Elements
Python Pandas - Duplicated Labels
Python Pandas - Grouping & Aggregation
Python Pandas - GroupBy
Python Pandas - Time-series Data
Python Pandas - Date Functionality
Python Pandas - Timedelta
Python Pandas - Sparse Data Structures
Python Pandas - Sparse Data
Python Pandas - Visualization
Python Pandas - Visualization
Python Pandas - Additional Concepts
Python Pandas - Caveats & Gotchas

Python Pandas - CategoricalDtype

Pandas CategoricalDtype

In Pandas,CategoricalDtype defines the data type for categorical data, specifying categories and their ordering. This data type can be useful when working with categorical data in Series, DataFrames, and various Pandas operations.

UsingCategoricalDtype provides better control over categorical data by explicitly defining categories and their order. This can help reduce memory usage and improve performance when handling large datasets. In this tutorial, we will learn aboutCategoricalDtype and its structure, and practical examples.

CategoricalDtype Structure

A CategoricalDtype is fully described by −

categories: A sequence of unique values without missing entries.
ordered
: A boolean indicating if the categories have an inherent order.

Creating CategoricalDtype

You can create aCategoricalDtype using thepandas.api.types.CategoricalDtype class. This class defines a custom data type for categorical data, allowing you to control categories and their order explicitly.

Following is the syntax for creating the CategoricalDtype in Pandas −

from pandas.api.types import CategoricalDtypecat_type = CategoricalDtype(categories=None, ordered=False)

Here,

categories: This parameter takes a sequence of unique, non-null values defining valid categories. It is stored as a Pandas index and if not provided, the dtype of that data index will be used.
ordered: It takes a boolean value indicating whether the categories have an order. By default it is set to False.

Example: Applying CategoricalDtype to a Series

The following example demonstrates creating a Pandas Series object with theCategoricalDtype.

import pandas as pdfrom pandas.api.types import CategoricalDtype# Define custom CategoricalDtypecat_type = CategoricalDtype(categories=["low", "medium", "high"], ordered=True)# Create a Series with a defined categorical types = pd.Series(["low", "high", "medium", "low"], dtype=cat_type)# Display the Seriesprint("Categorical Series:")print(s)

Following is the output of the above code −

Categorical Series:0       low1      high2    medium3       lowdtype: categoryCategories (3, object): ['low' < 'medium' < 'high']

Example: Applying CategoricalDtype to a DataFrame

The following example shows how to applyCategoricalDtype to a DataFrame column.

import pandas as pdfrom pandas.api.types import CategoricalDtype# Define custom CategoricalDtypecat_type = CategoricalDtype(categories=["small", "medium", "large"], ordered=True)# Create a DataFramedf = pd.DataFrame({"Size": ["large", "small", "medium", "large"]})# Convert column to CategoricalDtypedf["Size"] = df["Size"].astype(cat_type)# Display the DataFrameprint("DataFrame with Categorical Data:")print(df['Size'])

When we run above program, it produces following result −

DataFrame with Categorical Data:0     large1     small2    medium3     largeName: Size, dtype: categoryCategories (3, object): ['small' < 'medium' < 'large']

Usage of CategoricalDtype in Pandas

ACategoricalDtype can be used wherever pandas expects adtype. such as −

pandas.read_csv()
DataFrame.astype()
pandas.Series() constructor

Example: Using CategoricalDtype with DataFrame.astype()

This example shows using theCategoricalDtype with the PandasDataFeam.astype() method for specifying the data type of a DataFrame column.

import pandas as pdfrom pandas.api.types import CategoricalDtype# Creating a DataFramedata = {'col1': ["duck", "wolf", 'cat']}df = pd.DataFrame(data)# Convert column to CategoricalDtypecustom_dtype = CategoricalDtype(categories=["duck", "cat", "wolf"], ordered=True)df['col1'] = df['col1'].astype(custom_dtype)# Display the DataFrameprint("DataFrame with Categorical Data:")print(df['col1'])

While executing the above code we get the following output −

DataFrame with Categorical Data:0    duck1    wolf2     catName: col1, dtype: categoryCategories (3, object): ['duck' < 'cat' < 'wolf']

Example: Default String Representation

As a shortcut, you can also use the 'category' string representation as thedtype forCategoricalDtype(). This assumes default unordered categories inferred from the data.

This example uses the shortcut 'category' for applying categorical data type to the Pandas Series object.

import pandas as pdfrom pandas.api.types import CategoricalDtype# Create a Series with a defined categorical types = pd.Series(["low", "high", "medium", "low"], dtype='category')# Display the Seriesprint("Categorical Series:")print(s)

Following is the output of the above code −

Categorical Series:0       low1      high2    medium3       lowdtype: categoryCategories (3, object): ['high', 'low', 'medium']

Comparing CategoricalDtype Instances

Instances ofCategoricalDtype are equal if they have the same categories and order. When categories are unordered, their order does not matter.

Example

This example compares the ordered and unorderedCategoricalDtype instance for showing the equality semantics of the categorical data type object.

import pandas as pdfrom pandas.api.types import CategoricalDtypec1 = CategoricalDtype(['a', 'b', 'c'], ordered=False)# Unordered categories - order does not matterresult1 = (c1 == CategoricalDtype(['b', 'c', 'a'], ordered=False))print("Equality of two unordered same categories:", result1)# Ordered categories - different orders considered unequalresult2 = (c1 == CategoricalDtype(['a', 'b', 'c'], ordered=True))print("Equality of ordered category with an unordered one:", result2)# Comparison with 'category' shortcutprint(c1 == 'category')