Movatterモバイル変換


[0]ホーム

URL:


Skip to content
DEV Community
Log in Create account

DEV Community

Cover image for Back to Basics - Pandas #01
Charles De Barros
Charles De Barros

Posted on • Edited on

Back to Basics - Pandas #01

Introduction to Pandas

Pandas Official Documentation.

What is Pandas?

As per thePandas Official Documentation website:

Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for thePython programming language.

Pandas simplifies common data analysis tasks:

  • Load data from numerous types of files and online sources

  • Fast and efficient handling of large amounts of data

  • Filtering, sorting, editing and processing of data

  • Joining and aggregation of datasets

  • Tools for time series and statistical analysis

  • Display of data in tables and charts

Pandas data structures: DataFrames & Series

DataFrame (rows and columns)

DataFrame is a 2-dimensional labelled data structure with columns of potentially different types. You can think of it like a spreadsheet, an SQL table, or a dictionary of Series objects. It is generally the most commonly used Pandas object. Like Series, DataFrame accepts many different kinds of input:

  • Dict of 1D ndarrays, lists, dicts, orSeries

  • 2-D numpy.ndarray

  • Structured or record ndarray

  • ASeries

  • AnotherDataFrame

Notes onData Frames:

  • 2-dimensional labelled data structure made of rows and columns of 'potentially' different types.

  • Similar principle of a spreadsheet or SQL table. The most commonly used data structure used in Pandas.

  • Indexing starts from 0 (zero) for both rows and columns.

Series (one-dimensional data)

Series is a one-dimensional labelled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).

  • It is a one-dimensional labelled data.

  • An example of a Series is one column from a Data Frame.

  • Indexing in a Series starts from 0 (zero).

The basic method to create aSeries is to call:

s=pd.Series(data,index=index)
Enter fullscreen modeExit fullscreen mode

ASeries plus anotherSeries equals aData Frame.

Pandas Common Data Types

Data typeDescription
objectUsed for strings, or if the column contains a mix of data types
int64Used for integers ('64' relates to memory usage)
float64Used for floats, or where the column has both integers andNaN values
boolBooleans, i.e.,True orFalse
datatime64 /timedeltaTime-based values

Pandas Missing Values

TheNaN marker represents Pandas missing values orNULL values. In most cases, the termsmissing andnull are interchangeable. Date values use theNaT marker.

SymbolDescription
NaNUsed to indicate missing values in most instances and is supported by thefloar datatype.
NaTUsed to indicate missing values where adate type object may have been expected.

Exploratory Data Analysis - EDA

What isEDA?

Exploratory data analysis (EDA) is used by data scientists to analyse and investigate data sets and summarise their main characteristics, often employing data visualisation methods.

EDA helps determine how best to manipulate data sources to get the answers needed, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions.

EDA is primarily used to see what data can reveal beyond theformal modelling orhypothesis testing task and provides a better understanding of data set variables and their relationship. It can also help determine if thestatistical techniques you are considering for data analysis are appropriate.

Basic Pandas Data Frame exploration

Importing Pandas (and NumPy)

The following lines will bring bothPandas andNumPy libraries to the working environment.

importpandasaspdimportnumpyasnp
Enter fullscreen modeExit fullscreen mode

Data can be imported from various formats:CVS,spreadsheets,JSON,databases, etc.

The following line will import aCSV (Comma Separated Values) to the working sessions.

df=pd.read_csv(<filepath>)
Enter fullscreen modeExit fullscreen mode

Note: different parameters can be used with the.read_csv() function. In its simplest form, only thefilepath is required.

The "Telco-Customer-Churn.csv" dataset can be foundhere atKaggle.

# Will open the 'Telco-Customer-Churn.csv' in the 'data' folderdf=pd.read_csv("data/Telco-Customer-Churn.csv")
Enter fullscreen modeExit fullscreen mode

Checking Top/Bottom rows

We can use the.head() and the.tail() functions to see the top first 5 rows, and the bottom last 5 rows, inascending order.

The.head() function returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it. The default is5 rows.

The.tails() function returns the last n rows from the object based on position. It is useful for quickly verifying data, for example, after sorting or appending rows. The default is5 rows.

# Returns the first 5 top rows in ASC orderdf.head()# Returns the last 5 bottom rows in ASC orderdf.tail()
Enter fullscreen modeExit fullscreen mode

Checking random sample rows

The.sample() will return sample rowsat random. It offers an interesting look at thebody of the Data Frame. The default is to return only1 row.

# Returns 10 random row samples..sample(10)
Enter fullscreen modeExit fullscreen mode

Showing the Data Frame Dimensions

The.shape method to display thedataframe dimensions: rows and columns.

# In the 'Telco-Customer-Churn.csv' there are 7043 rows and 21 columns.df.shape(7043,21)
Enter fullscreen modeExit fullscreen mode

Displaying the Data Frame basic info

The.info() prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.

# Print a concise summary of a DataFrame.df.info()
Enter fullscreen modeExit fullscreen mode

Returns:

<class'pandas.core.frame.DataFrame'>RangeIndex: 7043 entries, 0 to 7042Data columns (total 21 columns): #   Column            Non-Null Count  Dtype  ---  ------            --------------  -----   0   customerID        7043 non-null   object  1   gender            7043 non-null   object  2   SeniorCitizen     7043 non-null   int64   3   Partner           7043 non-null   object  4   Dependents        7043 non-null   object  5   tenure            7043 non-null   int64   6   PhoneService      7043 non-null   object  7   MultipleLines     7043 non-null   object  8   InternetService   7043 non-null   object  9   OnlineSecurity    7043 non-null   object  10  OnlineBackup      7043 non-null   object  11  DeviceProtection  7043 non-null   object  12  TechSupport       7043 non-null   object  13  StreamingTV       7043 non-null   object  14  StreamingMovies   7043 non-null   object  15  Contract          7043 non-null   object  16  PaperlessBilling  7043 non-null   object  17  PaymentMethod     7043 non-null   object  18  MonthlyCharges    7043 non-null   float64 19  TotalCharges      7043 non-null   object  20  Churn             7043 non-null   object dtypes: float64(1), int64(2), object(18)memory usage: 1.1+ MB
Enter fullscreen modeExit fullscreen mode

Generating Descriptive Statistics

The.describe() function generates descriptive statistics.

Descriptive statistics is the process of using current and historical data to identify trends and relationships. It includes those that summarize the central tendency, dispersion and shape of a dataset’s distribution,excludingNaN values.

Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output will vary depending on what is provided.

It returns the following:

  • count: Total number of non-missing values

  • mean: The mean value

  • std: The standard deviation

  • min: The minimum value

  • 25%: The value of the first quartile (25th percentile)

  • 50%: The median value (50th percentile)

  • 75%: The value of the third quartile (75th percentile)

  • max: The maximum value

df.describe()
Enter fullscreen modeExit fullscreen mode

Returns:

SeniorCitizentenureMonthlyChargescount7043.0000007043.0000007043.000000mean0.16214732.37114964.761692std0.36861224.55948130.090047min0.0000000.00000018.25000025%0.0000009.00000035.50000050%0.00000029.00000070.35000075%0.00000055.00000089.850000max1.00000072.000000118.750000
Enter fullscreen modeExit fullscreen mode

Sorting Values

The.sort_values() function sorts by the values along either axis. it allows us to choose a 'column' to sort it by. The default isASC.

# ascending=False will sort the column values in DESC order.df.sort_values(by='monthlycharleges',ascending=False)
Enter fullscreen modeExit fullscreen mode

Summary

As a high-level, data manipulation and analysis library, Pandas is a powerful and versatile tool in any Data Analyst arsenal. It is fast, easy to use, and very comprehensive to analyse and manipulate datasets.

Pandas' realm of functions and methods is vast, no doubt about it. Pandas is a must-have tool data analysts should be acquainted with and use with their daily tasks.

Top comments(0)

Subscribe
pic
Create template

Templates let you quickly answer FAQs or store snippets for re-use.

Dismiss

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment'spermalink.

For further actions, you may consider blocking this person and/orreporting abuse

I am a Data Analyst experienced on developing apps with Ruby/Ruby on Rails. I am learning to work with Excel, PowerBI and Tableau and Python for data analytics.I bake in my spare time.
  • Location
    London
  • Pronouns
    He/him
  • Work
    Data Analyst
  • Joined

More fromCharles De Barros

DEV Community

We're a place where coders share, stay up-to-date and grow their careers.

Log in Create account

[8]ページ先頭

©2009-2025 Movatter.jp