
- Python Pandas - Home
- Python Pandas - Introduction
- Python Pandas - Environment Setup
- Python Pandas - Basics
- Python Pandas - Introduction to Data Structures
- Python Pandas - Index Objects
- Python Pandas - Panel
- Python Pandas - Basic Functionality
- Python Pandas - Indexing & Selecting Data
- Python Pandas - Series
- Python Pandas - Series
- Python Pandas - Slicing a Series Object
- Python Pandas - Attributes of a Series Object
- Python Pandas - Arithmetic Operations on Series Object
- Python Pandas - Converting Series to Other Objects
- Python Pandas - DataFrame
- Python Pandas - DataFrame
- Python Pandas - Accessing DataFrame
- Python Pandas - Slicing a DataFrame Object
- Python Pandas - Modifying DataFrame
- Python Pandas - Removing Rows from a DataFrame
- Python Pandas - Arithmetic Operations on DataFrame
- Python Pandas - IO Tools
- Python Pandas - IO Tools
- Python Pandas - Working with CSV Format
- Python Pandas - Reading & Writing JSON Files
- Python Pandas - Reading Data from an Excel File
- Python Pandas - Writing Data to Excel Files
- Python Pandas - Working with HTML Data
- Python Pandas - Clipboard
- Python Pandas - Working with HDF5 Format
- Python Pandas - Comparison with SQL
- Python Pandas - Data Handling
- Python Pandas - Sorting
- Python Pandas - Reindexing
- Python Pandas - Iteration
- Python Pandas - Concatenation
- Python Pandas - Statistical Functions
- Python Pandas - Descriptive Statistics
- Python Pandas - Working with Text Data
- Python Pandas - Function Application
- Python Pandas - Options & Customization
- Python Pandas - Window Functions
- Python Pandas - Aggregations
- Python Pandas - Merging/Joining
- Python Pandas - MultiIndex
- Python Pandas - Basics of MultiIndex
- Python Pandas - Indexing with MultiIndex
- Python Pandas - Advanced Reindexing with MultiIndex
- Python Pandas - Renaming MultiIndex Labels
- Python Pandas - Sorting a MultiIndex
- Python Pandas - Binary Operations
- Python Pandas - Binary Comparison Operations
- Python Pandas - Boolean Indexing
- Python Pandas - Boolean Masking
- Python Pandas - Data Reshaping & Pivoting
- Python Pandas - Pivoting
- Python Pandas - Stacking & Unstacking
- Python Pandas - Melting
- Python Pandas - Computing Dummy Variables
- Python Pandas - Categorical Data
- Python Pandas - Categorical Data
- Python Pandas - Ordering & Sorting Categorical Data
- Python Pandas - Comparing Categorical Data
- Python Pandas - Handling Missing Data
- Python Pandas - Missing Data
- Python Pandas - Filling Missing Data
- Python Pandas - Interpolation of Missing Values
- Python Pandas - Dropping Missing Data
- Python Pandas - Calculations with Missing Data
- Python Pandas - Handling Duplicates
- Python Pandas - Duplicated Data
- Python Pandas - Counting & Retrieving Unique Elements
- Python Pandas - Duplicated Labels
- Python Pandas - Grouping & Aggregation
- Python Pandas - GroupBy
- Python Pandas - Time-series Data
- Python Pandas - Date Functionality
- Python Pandas - Timedelta
- Python Pandas - Sparse Data Structures
- Python Pandas - Sparse Data
- Python Pandas - Visualization
- Python Pandas - Visualization
- Python Pandas - Additional Concepts
- Python Pandas - Caveats & Gotchas
Python Pandas - Merging/Joining
Pandas provides high-performance, in-memory join operations similar to those in SQL databases. These operations allow you to merge multipleDataFrame objects based on common keys or indexes efficiently.
The merge() Method in Pandas
TheDataFrame.merge() method in Pandas enables merging of DataFrame or named Series objects using database-style joins. A named Series is treated as a DataFrame with a single named column. Joins can be performed on columns or indexes.
If merging on columns, DataFrame indexes are ignored. If merging on indexes or indexes with columns, then the index will remains the same. However, in cross merges (how='cross'), you cannot specify column names for merging.
Below is the syntax of this method −
DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False)
The key parameters are −
right: A DataFrame or a named Series to merge with.
on: Columns (names) to join on. Must be found in both the DataFrame objects.
left_on: Columns from the left DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame.
right_on: Columns from the right DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame.
left_index: IfTrue, use the index (row labels) from the left DataFrame as its join key(s). In case of a DataFrame with a MultiIndex (hierarchical), the number of levels must match the number of join keys from the right DataFrame.
right_index: Same usage asleft_index for the right DataFrame.
how: Determines type of join operation, available options are 'left', 'right', 'outer', 'inner', and 'cross'. Defaults to 'inner'. Each method has been described below.
sort: Sort the result DataFrame by the join keys in lexicographical order. Defaults toTrue, setting toFalse will improve the performance substantially in many cases.
Example
Let's create two DataFrames and perform merge operations on them.
import pandas as pd# Creating the first DataFrameleft = pd.DataFrame({'id': [1, 2, 3, 4, 5],'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],'subject_id': ['sub1', 'sub2', 'sub4', 'sub6', 'sub5']})# Creating the second DataFrameright = pd.DataFrame({'id': [1, 2, 3, 4, 5],'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],'subject_id': ['sub2', 'sub4', 'sub3', 'sub6', 'sub5']})print("Left DataFrame:")print(left)print("\nRight DataFrame:")print(right)Itsoutput is as follows −
Left DataFrame:
| id | Name | subject_id | |
|---|---|---|---|
| 0 | 1 | Alex | sub1 |
| 1 | 2 | Amy | sub2 |
| 2 | 3 | Allen | sub4 |
| 3 | 4 | Alice | sub6 |
| 4 | 5 | Ayoung | sub5 |
| id | Name | subject_id | |
|---|---|---|---|
| 0 | 1 | Billy | sub2 |
| 1 | 2 | Brian | sub4 |
| 2 | 3 | Bran | sub3 |
| 3 | 4 | Bryce | sub6 |
| 4 | 5 | Betty | sub5 |
Merge Two DataFrames on a Key
You can merge two DataFrames using a common key column by specifying the column name in theon parameter of themerge() method.
Example
The following example demonstrates how to merge two DataFrames on a key using theDataFrame.merge() method.
import pandas as pd# Creating the first DataFrameleft = pd.DataFrame({'id': [1, 2, 3, 4, 5],'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],'subject_id': ['sub1', 'sub2', 'sub4', 'sub6', 'sub5']})# Creating the second DataFrameright = pd.DataFrame({'id': [1, 2, 3, 4, 5],'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],'subject_id': ['sub2', 'sub4', 'sub3', 'sub6', 'sub5']})# Merging DataFrames on a key 'id'result = left.merge(right, on='id')print(result)Itsoutput is as follows −
| id | Name_x | subject_id_x | Name_y | subject_id_y | |
|---|---|---|---|---|---|
| 0 | 1 | Alex | sub1 | Billy | sub2 |
| 1 | 2 | Amy | sub2 | Brian | sub4 |
| 2 | 3 | Allen | sub4 | Bran | sub3 |
| 3 | 4 | Alice | sub6 | Bryce | sub6 |
| 4 | 5 | Ayoung | sub5 | Betty | sub5 |
Merge Two DataFrames on Multiple Keys
To merge two DataFrames on multiple keys, provide a list of column names to theon parameter.
Example
The following example demonstrates how to merge DataFrames on multiple keys using themerge() method.
import pandas as pd# Creating the first DataFrameleft = pd.DataFrame({'id': [1, 2, 3, 4, 5],'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],'subject_id': ['sub1', 'sub2', 'sub4', 'sub6', 'sub5']})# Creating the second DataFrameright = pd.DataFrame({'id':[1,2,3,4,5],'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],'subject_id':['sub2','sub4','sub3','sub6','sub5']})# Merging on multiple keys 'id' and 'subject_id'result = left.merge(right, on=['id', 'subject_id'])print(result)Itsoutput is as follows −
| id | Name_x | subject_id | Name_y | |
|---|---|---|---|---|
| 0 | 4 | Alice | sub6 | Bryce |
| 1 | 5 | Ayoung | sub5 | Betty |
Merge Using 'how' Argument
Thehow argument determines which keys to include in the resulting DataFrame. If a key combination does not appear in either the left or right DataFrame, the values in the joined table will beNaN.
Merge Methods and Their SQL Equivalents
The following table summarizes thehow options and their SQL equivalents −
| Merge Method | SQL Equivalent | Description |
|---|---|---|
| left | LEFT OUTER JOIN | Use keys from left object |
| right | RIGHT OUTER JOIN | Use keys from right object |
| outer | FULL OUTER JOIN | Union of keys from both DataFrames. |
| inner | INNER JOIN | Intersection of keys from both DataFrames. |
Example: Left Join
This example demonstrates merging the DataFrame by using theleft method.
import pandas as pd# Creating the first DataFrameleft = pd.DataFrame({'id': [1, 2, 3, 4, 5],'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],'subject_id': ['sub1', 'sub2', 'sub4', 'sub6', 'sub5']})# Creating the second DataFrameright = pd.DataFrame({'id': [1, 2, 3, 4, 5],'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],'subject_id': ['sub2', 'sub4', 'sub3', 'sub6', 'sub5']})# Merging DataFrames using the left join method print(left.merge(right, on='subject_id', how='left'))Itsoutput is as follows −
| id_x | Name_x | subject_id | id_y | Name_y | |
|---|---|---|---|---|---|
| 0 | 1 | Alex | sub1 | NaN | NaN |
| 1 | 2 | Amy | sub2 | 1.0 | Billy |
| 2 | 3 | Allen | sub4 | 2.0 | Brian |
| 3 | 4 | Alice | sub6 | 4.0 | Bryce |
| 4 | 5 | Ayoung | sub5 | 5.0 | Betty |
Example: Right Join
This example performs the right join operation on two DataFrames using themerge() method by setting thehow='right'.
import pandas as pd# Creating the first DataFrameleft = pd.DataFrame({'id': [1, 2, 3, 4, 5],'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],'subject_id': ['sub1', 'sub2', 'sub4', 'sub6', 'sub5']})# Creating the second DataFrameright = pd.DataFrame({'id': [1, 2, 3, 4, 5],'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],'subject_id': ['sub2', 'sub4', 'sub3', 'sub6', 'sub5']})# Merging DataFrames using the right join methodprint(left.merge(right, on='subject_id', how='right'))Itsoutput is as follows −
| id_x | Name_x | subject_id | id_y | Name_y | |
|---|---|---|---|---|---|
| 0 | 2.0 | Amy | sub2 | 1 | Billy |
| 1 | 3.0 | Allen | sub4 | 2 | Brian |
| 2 | NaN | NaN | sub3 | 3 | Bran |
| 3 | 4.0 | Alice | sub6 | 4 | Bryce |
| 4 | 5.0 | Ayoung | sub5 | 5 | Betty |
Example: Outer Join
This example will apply the outer join operation on two DataFrames by specifying thehow='outer' in themerge() method.
import pandas as pd# Creating the first DataFrameleft = pd.DataFrame({'id': [1, 2, 3, 4, 5],'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],'subject_id': ['sub1', 'sub2', 'sub4', 'sub6', 'sub5']})# Creating the second DataFrameright = pd.DataFrame({'id': [1, 2, 3, 4, 5],'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],'subject_id': ['sub2', 'sub4', 'sub3', 'sub6', 'sub5']})# Merging the DataFrames using the outer join print(left.merge(right, how='outer', on='subject_id'))Itsoutput is as follows −
| id_x | Name_x | subject_id | id_y | Name_y | |
|---|---|---|---|---|---|
| 0 | 1.0 | Alex | sub1 | NaN | NaN |
| 1 | 2.0 | Amy | sub2 | 1.0 | Billy |
| 2 | 3.0 | Allen | sub4 | 2.0 | Brian |
| 3 | 4.0 | Alice | sub6 | 4.0 | Bryce |
| 4 | 5.0 | Ayoung | sub5 | 5.0 | Betty |
| 5 | NaN | NaN | sub3 | 3.0 | Bran |
Inner Join
Joining will be performed on index. Join operation honors the object on which it is called.So,a.join(b) is not equal tob.join(a).
Example
The following example demonstrates how to apply inner join operation to the two DataFrames using theDataFrame.merge() method.
import pandas as pd# Creating the first DataFrameleft = pd.DataFrame({'id': [1, 2, 3, 4, 5],'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],'subject_id': ['sub1', 'sub2', 'sub4', 'sub6', 'sub5']})# Creating the second DataFrameright = pd.DataFrame({'id': [1, 2, 3, 4, 5],'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],'subject_id': ['sub2', 'sub4', 'sub3', 'sub6', 'sub5']})# Merge the DataFrames using the inner join methodprint(left.merge(right, on='subject_id', how='inner'))Itsoutput is as follows −
| id_x | Name_x | subject_id | id_y | Name_y | |
|---|---|---|---|---|---|
| 0 | 2 | Amy | sub2 | 1 | Billy |
| 1 | 3 | Allen | sub4 | 2 | Brian |
| 2 | 4 | Alice | sub6 | 4 | Bryce |
| 3 | 5 | Ayoung | sub5 | 5 | Betty |
The join() Method in Pandas
Pandas also provides aDataFrame.join() method, which is useful for merging DataFrames based on their index. It works similarly toDataFrame.merge() but is more efficient for index-based operations.
Below is the syntax of this method −
DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='')
Example
This example demonstrates how to use theDataFrame.join() method for merging DataFrames using indexes instead of columns.
import pandas as pd# Creating the first DataFrameleft = pd.DataFrame({'id': [1, 2, 3, 4, 5],'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],'subject_id': ['sub1', 'sub2', 'sub4', 'sub6', 'sub5']})# Creating the second DataFrameright = pd.DataFrame({'id': [1, 2, 3, 4, 5],'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],'subject_id': ['sub2', 'sub4', 'sub3', 'sub6', 'sub5']})# Merge the DataFrames using the join() methodresult = left.join(right, lsuffix='_left', rsuffix='_right')print(result)Itsoutput is as follows −
| id_left | Name_left | subject_id_left | id_right | Name_right | subject_id_right | |
|---|---|---|---|---|---|---|
| 0 | 1 | Alex | sub1 | 1 | Billy | sub2 |
| 1 | 2 | Amy | sub2 | 2 | Brian | sub4 |
| 2 | 3 | Allen | sub4 | 3 | Bran | sub3 |
| 3 | 4 | Alice | sub6 | 4 | Bryce | sub6 |
| 4 | 5 | Ayoung | sub5 | 5 | Betty | sub5 |