DataFrame is the base component of Polars. It is worth learning its basics as you begin your journey in Polars. DataFrame is like a table with rows and columns. It’s the fundamental structure that other Polars components are deeplyinterconnected with.
If you’ve used the pandas library before, you might be surprised to learn that Polars actually doesn’t have a concept of anindex. In pandas, an index is a series of labels that identify each row. It helps you select and align rows of your DataFrame. This is also different from the indexes you might see in SQL databases in that an index in pandas is not meant to apply for a faster dataretrieval performance.
You might’ve found index in pandas useful, but I bet that they also gave you some headaches. Polars avoids the complexity that comes with index. If you’d like to learn more about the differences in concepts between pandas and Polars, you can look at this page in the Polarsdocumentation:https://pola-rs.github.io/polars/user-guide/migration/pandas.
In this recipe, we’ll cover some ways to create a Polars DataFrame, as well as useful methods to extractDataFrame attributes.
Getting ready
We’ll use a dataset stored in this GitHub repo:https://github.com/PacktPublishing/Polars-Cookbook/blob/main/data/titanic_dataset.csv. Also, make sure that you import the Polars library at the beginning ofyour code:
Import polars as pl
How to do it...
We’ll start by creating a DataFrame and exploringits attributes.:
- Create a DataFrame from scratch with a Python dictionary asthe input:
df = pl.DataFrame({ 'nums': [1,2,3,4,5], 'letters': ['a','b','c','d','e']})df.head()The preceding code will return thefollowing output:
Figure 1.3 – The output of an example DataFrame
- Create a DataFrame by reading a
.csv file. Then take a peek atthe dataset:df = pl.read_csv('../data/titanic_dataset.csv')df.head()The preceding code will return thefollowing output:
Figure 1.4 – The first few rows of the titanic dataset
Explore DataFrame attributes..schemas gives you the combination of each column name and data type in Python dictionary. You can get column names and data types in separate lists with.columnsand.dtypes:
df.schema
The preceding code will return thefollowing output:
>> Schema([('PassengerId', Int64), ('Survived', Int64), ('Pclass', Int64), ('Name', String), ('Sex', String), ('Age', Float64), ('SibSp', Int64), ('Parch', Int64), ('Ticket', String), ('Fare', Float64), ('Cabin', String), ('Embarked', String)])df.columnsThe preceding code will return thefollowing output:
>> ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']df.dtypes
The preceding code will return thefollowing output:
>> [Int64, Int64, Int64, String, String, Float64, Int64, Int64, String, Float64, String, String]
You can get the height and width of your DataFrame with.shape. You can also get the height and width individually with.height and.widthas well:
df.shape
The preceding code will return thefollowing output:
>> (891, 12)df.height
The preceding code will return thefollowing output:
>> 891df.width
The preceding code will return thefollowing output:
>> 12df.flags
The preceding code will return thefollowing output:
>> {'PassengerId': {'SORTED_ASC': False, 'SORTED_DESC': False}, 'Survived': {'SORTED_ASC': False, 'SORTED_DESC': False}, 'Pclass': {'SORTED_ASC': False, 'SORTED_DESC': False}, 'Name': {'SORTED_ASC': False, 'SORTED_DESC': False}, 'Sex': {'SORTED_ASC': False, 'SORTED_DESC': False}, 'Age': {'SORTED_ASC': False, 'SORTED_DESC': False}, 'SibSp': {'SORTED_ASC': False, 'SORTED_DESC': False}, 'Parch': {'SORTED_ASC': False, 'SORTED_DESC': False}, 'Ticket': {'SORTED_ASC': False, 'SORTED_DESC': False}, 'Fare': {'SORTED_ASC': False, 'SORTED_DESC': False}, 'Cabin': {'SORTED_ASC': False, 'SORTED_DESC': False}, 'Embarked': {'SORTED_ASC': False, 'SORTED_DESC': False}}How it works...
Withinpl.DataFrame(), I have added a Python dictionary as the data source. Its keys are strings, and its values are lists. Data types are auto-inferred unless you specifythe schema.
The.head() method is handy in your analysis workflow. It shows the firstn rows, wheren is the number of rows you specify. The default value ofn is setto5.
pl.read_csv() is one of the common ways to read data into a DataFrame. It involves specifying the path of the file you want to read. It has many parameters that help you load data efficiently, tailored to your use case. We’ll cover the topic of reading and writing files in detail in thenext chapter.
There’s more...
The Polars DataFrame can take many forms of data as its source, such as Python dictionaries, the Polars Series, NumPy array, pandas DataFrame, and so on. You can even utilize functions likepl.from_numpy() andpl.from_pandas() to import data directly from other structures instead ofusingpl.DataFrame().
Also, there are several parameters you can set when creating a DataFrame, including the schema. You can preset the schema of your dataset, or else it will be auto-inferred byPolars’s engine:
import numpy as npnumpy_arr = np.array([[1,1,1], [2,2,2]])df = pl.from_numpy(numpy_arr, schema={'ones': pl.Float32, 'twos': pl.Int8}, orient='col')df.head()The preceding code will return thefollowing output:
Figure 1.5 – A DataFrame created from a NumPy array
Both reading into a DataFrame and outputting to other structures such as pandas DataFrame and pyarrow.Table is possible. We’ll cover that inChapter 10,Interoperability with OtherPython Libraries.
You can basically categorize the data typesin Polars intofive categories:
- Numeric
- String/categorical
- Date/time
- Nested
- Other (Boolean, Binary, andso forth)
We’ll look at working with specific types of data throughout this book, but it’s good to know what data typesexist early on in the journey of learningabout Polars.
You can see a complete list of data types on this Polars documentationpage:https://pola-rs.github.io/polars/py-polars/html/reference/datatypes.html.
See also
Please refer toeach section of the Polars documentation foradditional information: