- User Guide
- Merge,...
Merge, join, concatenate and compare#
pandas provides various methods for combining and comparingSeries
orDataFrame
.
concat()
: Merge multipleSeries
orDataFrame
objects along a shared index or columnDataFrame.join()
: Merge multipleDataFrame
objects along the columnsDataFrame.combine_first()
: Update missing values with non-missing values in the same locationmerge()
: Combine twoSeries
orDataFrame
objects with SQL-style joiningmerge_ordered()
: Combine twoSeries
orDataFrame
objects along an ordered axismerge_asof()
: Combine twoSeries
orDataFrame
objects by near instead of exact matching keysSeries.compare()
andDataFrame.compare()
: Show differences in values between twoSeries
orDataFrame
objects
concat()
#
Theconcat()
function concatenates an arbitrary amount ofSeries
orDataFrame
objects along an axis whileperforming optional set logic (union or intersection) of the indexes onthe other axes. Likenumpy.concatenate
,concat()
takes a list or dict of homogeneously-typed objects and concatenates them.
In [1]:df1=pd.DataFrame( ...:{ ...:"A":["A0","A1","A2","A3"], ...:"B":["B0","B1","B2","B3"], ...:"C":["C0","C1","C2","C3"], ...:"D":["D0","D1","D2","D3"], ...:}, ...:index=[0,1,2,3], ...:) ...:In [2]:df2=pd.DataFrame( ...:{ ...:"A":["A4","A5","A6","A7"], ...:"B":["B4","B5","B6","B7"], ...:"C":["C4","C5","C6","C7"], ...:"D":["D4","D5","D6","D7"], ...:}, ...:index=[4,5,6,7], ...:) ...:In [3]:df3=pd.DataFrame( ...:{ ...:"A":["A8","A9","A10","A11"], ...:"B":["B8","B9","B10","B11"], ...:"C":["C8","C9","C10","C11"], ...:"D":["D8","D9","D10","D11"], ...:}, ...:index=[8,9,10,11], ...:) ...:In [4]:frames=[df1,df2,df3]In [5]:result=pd.concat(frames)In [6]:resultOut[6]: A B C D0 A0 B0 C0 D01 A1 B1 C1 D12 A2 B2 C2 D23 A3 B3 C3 D34 A4 B4 C4 D45 A5 B5 C5 D56 A6 B6 C6 D67 A7 B7 C7 D78 A8 B8 C8 D89 A9 B9 C9 D910 A10 B10 C10 D1011 A11 B11 C11 D11

Note
concat()
makes a full copy of the data, and iterativelyreusingconcat()
can create unnecessary copies. Collect allDataFrame
orSeries
objects in a list before usingconcat()
.
frames=[process_your_file(f)forfinfiles]result=pd.concat(frames)
Note
When concatenatingDataFrame
with named axes, pandas will attempt to preservethese index/column names whenever possible. In the case where all inputs share acommon name, this name will be assigned to the result. When the input names donot all agree, the result will be unnamed. The same is true forMultiIndex
,but the logic is applied separately on a level-by-level basis.
Joining logic of the resulting axis#
Thejoin
keyword specifies how to handle axis values that don’t exist in the firstDataFrame
.
join='outer'
takes the union of all axis values
In [7]:df4=pd.DataFrame( ...:{ ...:"B":["B2","B3","B6","B7"], ...:"D":["D2","D3","D6","D7"], ...:"F":["F2","F3","F6","F7"], ...:}, ...:index=[2,3,6,7], ...:) ...:In [8]:result=pd.concat([df1,df4],axis=1)In [9]:resultOut[9]: A B C D B D F0 A0 B0 C0 D0 NaN NaN NaN1 A1 B1 C1 D1 NaN NaN NaN2 A2 B2 C2 D2 B2 D2 F23 A3 B3 C3 D3 B3 D3 F36 NaN NaN NaN NaN B6 D6 F67 NaN NaN NaN NaN B7 D7 F7

join='inner'
takes the intersection of the axis values
In [10]:result=pd.concat([df1,df4],axis=1,join="inner")In [11]:resultOut[11]: A B C D B D F2 A2 B2 C2 D2 B2 D2 F23 A3 B3 C3 D3 B3 D3 F3

To perform an effective “left” join using theexact index from the originalDataFrame
, result can be reindexed.
In [12]:result=pd.concat([df1,df4],axis=1).reindex(df1.index)In [13]:resultOut[13]: A B C D B D F0 A0 B0 C0 D0 NaN NaN NaN1 A1 B1 C1 D1 NaN NaN NaN2 A2 B2 C2 D2 B2 D2 F23 A3 B3 C3 D3 B3 D3 F3

Ignoring indexes on the concatenation axis#
ForDataFrame
objects which don’t have a meaningful index, theignore_index
ignores overlapping indexes.
In [14]:result=pd.concat([df1,df4],ignore_index=True,sort=False)In [15]:resultOut[15]: A B C D F0 A0 B0 C0 D0 NaN1 A1 B1 C1 D1 NaN2 A2 B2 C2 D2 NaN3 A3 B3 C3 D3 NaN4 NaN B2 NaN D2 F25 NaN B3 NaN D3 F36 NaN B6 NaN D6 F67 NaN B7 NaN D7 F7

ConcatenatingSeries
andDataFrame
together#
You can concatenate a mix ofSeries
andDataFrame
objects. TheSeries
will be transformed toDataFrame
with the column name asthe name of theSeries
.
In [16]:s1=pd.Series(["X0","X1","X2","X3"],name="X")In [17]:result=pd.concat([df1,s1],axis=1)In [18]:resultOut[18]: A B C D X0 A0 B0 C0 D0 X01 A1 B1 C1 D1 X12 A2 B2 C2 D2 X23 A3 B3 C3 D3 X3

UnnamedSeries
will be numbered consecutively.
In [19]:s2=pd.Series(["_0","_1","_2","_3"])In [20]:result=pd.concat([df1,s2,s2,s2],axis=1)In [21]:resultOut[21]: A B C D 0 1 20 A0 B0 C0 D0 _0 _0 _01 A1 B1 C1 D1 _1 _1 _12 A2 B2 C2 D2 _2 _2 _23 A3 B3 C3 D3 _3 _3 _3

ignore_index=True
will drop all name references.
In [22]:result=pd.concat([df1,s1],axis=1,ignore_index=True)In [23]:resultOut[23]: 0 1 2 3 40 A0 B0 C0 D0 X01 A1 B1 C1 D1 X12 A2 B2 C2 D2 X23 A3 B3 C3 D3 X3

Resultingkeys
#
Thekeys
argument adds another axis level to the resulting index or column (creatingaMultiIndex
) associate specific keys with each originalDataFrame
.
In [24]:result=pd.concat(frames,keys=["x","y","z"])In [25]:resultOut[25]: A B C Dx 0 A0 B0 C0 D0 1 A1 B1 C1 D1 2 A2 B2 C2 D2 3 A3 B3 C3 D3y 4 A4 B4 C4 D4 5 A5 B5 C5 D5 6 A6 B6 C6 D6 7 A7 B7 C7 D7z 8 A8 B8 C8 D8 9 A9 B9 C9 D9 10 A10 B10 C10 D10 11 A11 B11 C11 D11In [26]:result.loc["y"]Out[26]: A B C D4 A4 B4 C4 D45 A5 B5 C5 D56 A6 B6 C6 D67 A7 B7 C7 D7

Thekeys
argument cane override the column nameswhen creating a newDataFrame
based on existingSeries
.
In [27]:s3=pd.Series([0,1,2,3],name="foo")In [28]:s4=pd.Series([0,1,2,3])In [29]:s5=pd.Series([0,1,4,5])In [30]:pd.concat([s3,s4,s5],axis=1)Out[30]: foo 0 10 0 0 01 1 1 12 2 2 43 3 3 5In [31]:pd.concat([s3,s4,s5],axis=1,keys=["red","blue","yellow"])Out[31]: red blue yellow0 0 0 01 1 1 12 2 2 43 3 3 5
You can also pass a dict toconcat()
in which case the dict keys will be usedfor thekeys
argument unless otherkeys
argument is specified:
In [32]:pieces={"x":df1,"y":df2,"z":df3}In [33]:result=pd.concat(pieces)In [34]:resultOut[34]: A B C Dx 0 A0 B0 C0 D0 1 A1 B1 C1 D1 2 A2 B2 C2 D2 3 A3 B3 C3 D3y 4 A4 B4 C4 D4 5 A5 B5 C5 D5 6 A6 B6 C6 D6 7 A7 B7 C7 D7z 8 A8 B8 C8 D8 9 A9 B9 C9 D9 10 A10 B10 C10 D10 11 A11 B11 C11 D11

In [35]:result=pd.concat(pieces,keys=["z","y"])In [36]:resultOut[36]: A B C Dz 8 A8 B8 C8 D8 9 A9 B9 C9 D9 10 A10 B10 C10 D10 11 A11 B11 C11 D11y 4 A4 B4 C4 D4 5 A5 B5 C5 D5 6 A6 B6 C6 D6 7 A7 B7 C7 D7

TheMultiIndex
created has levels that are constructed from the passed keys andthe index of theDataFrame
pieces:
In [37]:result.index.levelsOut[37]:FrozenList([['z', 'y'], [4, 5, 6, 7, 8, 9, 10, 11]])
levels
argument allows specifying resulting levels associated with thekeys
In [38]:result=pd.concat( ....:pieces,keys=["x","y","z"],levels=[["z","y","x","w"]],names=["group_key"] ....:) ....:In [39]:resultOut[39]: A B C Dgroup_keyx 0 A0 B0 C0 D0 1 A1 B1 C1 D1 2 A2 B2 C2 D2 3 A3 B3 C3 D3y 4 A4 B4 C4 D4 5 A5 B5 C5 D5 6 A6 B6 C6 D6 7 A7 B7 C7 D7z 8 A8 B8 C8 D8 9 A9 B9 C9 D9 10 A10 B10 C10 D10 11 A11 B11 C11 D11

In [40]:result.index.levelsOut[40]:FrozenList([['z', 'y', 'x', 'w'], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]])
Appending rows to aDataFrame
#
If you have aSeries
that you want to append as a single row to aDataFrame
, you can convert the row into aDataFrame
and useconcat()
In [41]:s2=pd.Series(["X0","X1","X2","X3"],index=["A","B","C","D"])In [42]:result=pd.concat([df1,s2.to_frame().T],ignore_index=True)In [43]:resultOut[43]: A B C D0 A0 B0 C0 D01 A1 B1 C1 D12 A2 B2 C2 D23 A3 B3 C3 D34 X0 X1 X2 X3

merge()
#
merge()
performs join operations similar to relational databases like SQL.Users who are familiar with SQL but new to pandas can reference acomparison with SQL.
Merge types#
merge()
implements common SQL style joining operations.
one-to-one: joining two
DataFrame
objects ontheir indexes which must contain unique values.many-to-one: joining a unique index to one ormore columns in a different
DataFrame
.many-to-many : joining columns on columns.
Note
When joining columns on columns, potentially a many-to-many join, anyindexes on the passedDataFrame
objectswill be discarded.
For amany-to-many join, if a key combination appearsmore than once in both tables, theDataFrame
will have theCartesianproduct of the associated data.
In [44]:left=pd.DataFrame( ....:{ ....:"key":["K0","K1","K2","K3"], ....:"A":["A0","A1","A2","A3"], ....:"B":["B0","B1","B2","B3"], ....:} ....:) ....:In [45]:right=pd.DataFrame( ....:{ ....:"key":["K0","K1","K2","K3"], ....:"C":["C0","C1","C2","C3"], ....:"D":["D0","D1","D2","D3"], ....:} ....:) ....:In [46]:result=pd.merge(left,right,on="key")In [47]:resultOut[47]: key A B C D0 K0 A0 B0 C0 D01 K1 A1 B1 C1 D12 K2 A2 B2 C2 D23 K3 A3 B3 C3 D3

Thehow
argument tomerge()
specifies which keys areincluded in the resulting table. If a key combinationdoes not appear ineither the left or right tables, the values in the joined table will beNA
. Here is a summary of thehow
options and their SQL equivalent names:
Merge method | SQL Join Name | Description |
---|---|---|
|
| Use keys from left frame only |
|
| Use keys from right frame only |
|
| Use union of keys from both frames |
|
| Use intersection of keys from both frames |
|
| Create the cartesian product of rows of both frames |
In [48]:left=pd.DataFrame( ....:{ ....:"key1":["K0","K0","K1","K2"], ....:"key2":["K0","K1","K0","K1"], ....:"A":["A0","A1","A2","A3"], ....:"B":["B0","B1","B2","B3"], ....:} ....:) ....:In [49]:right=pd.DataFrame( ....:{ ....:"key1":["K0","K1","K1","K2"], ....:"key2":["K0","K0","K0","K0"], ....:"C":["C0","C1","C2","C3"], ....:"D":["D0","D1","D2","D3"], ....:} ....:) ....:In [50]:result=pd.merge(left,right,how="left",on=["key1","key2"])In [51]:resultOut[51]: key1 key2 A B C D0 K0 K0 A0 B0 C0 D01 K0 K1 A1 B1 NaN NaN2 K1 K0 A2 B2 C1 D13 K1 K0 A2 B2 C2 D24 K2 K1 A3 B3 NaN NaN

In [52]:result=pd.merge(left,right,how="right",on=["key1","key2"])In [53]:resultOut[53]: key1 key2 A B C D0 K0 K0 A0 B0 C0 D01 K1 K0 A2 B2 C1 D12 K1 K0 A2 B2 C2 D23 K2 K0 NaN NaN C3 D3

In [54]:result=pd.merge(left,right,how="outer",on=["key1","key2"])In [55]:resultOut[55]: key1 key2 A B C D0 K0 K0 A0 B0 C0 D01 K0 K1 A1 B1 NaN NaN2 K1 K0 A2 B2 C1 D13 K1 K0 A2 B2 C2 D24 K2 K0 NaN NaN C3 D35 K2 K1 A3 B3 NaN NaN

In [56]:result=pd.merge(left,right,how="inner",on=["key1","key2"])In [57]:resultOut[57]: key1 key2 A B C D0 K0 K0 A0 B0 C0 D01 K1 K0 A2 B2 C1 D12 K1 K0 A2 B2 C2 D2

In [58]:result=pd.merge(left,right,how="cross")In [59]:resultOut[59]: key1_x key2_x A B key1_y key2_y C D0 K0 K0 A0 B0 K0 K0 C0 D01 K0 K0 A0 B0 K1 K0 C1 D12 K0 K0 A0 B0 K1 K0 C2 D23 K0 K0 A0 B0 K2 K0 C3 D34 K0 K1 A1 B1 K0 K0 C0 D0.. ... ... .. .. ... ... .. ..11 K1 K0 A2 B2 K2 K0 C3 D312 K2 K1 A3 B3 K0 K0 C0 D013 K2 K1 A3 B3 K1 K0 C1 D114 K2 K1 A3 B3 K1 K0 C2 D215 K2 K1 A3 B3 K2 K0 C3 D3[16 rows x 8 columns]

You canSeries
and aDataFrame
with aMultiIndex
if the names oftheMultiIndex
correspond to the columns from theDataFrame
. TransformtheSeries
to aDataFrame
usingSeries.reset_index()
before merging
In [60]:df=pd.DataFrame({"Let":["A","B","C"],"Num":[1,2,3]})In [61]:dfOut[61]: Let Num0 A 11 B 22 C 3In [62]:ser=pd.Series( ....:["a","b","c","d","e","f"], ....:index=pd.MultiIndex.from_arrays( ....:[["A","B","C"]*2,[1,2,3,4,5,6]],names=["Let","Num"] ....:), ....:) ....:In [63]:serOut[63]:Let NumA 1 aB 2 bC 3 cA 4 dB 5 eC 6 fdtype: objectIn [64]:pd.merge(df,ser.reset_index(),on=["Let","Num"])Out[64]: Let Num 00 A 1 a1 B 2 b2 C 3 c
Performing an outer join with duplicate join keys inDataFrame
In [65]:left=pd.DataFrame({"A":[1,2],"B":[2,2]})In [66]:right=pd.DataFrame({"A":[4,5,6],"B":[2,2,2]})In [67]:result=pd.merge(left,right,on="B",how="outer")In [68]:resultOut[68]: A_x B A_y0 1 2 41 1 2 52 1 2 63 2 2 44 2 2 55 2 2 6

Warning
Merging on duplicate keys significantly increase the dimensions of the resultand can cause a memory overflow.
Merge key uniqueness#
Thevalidate
argument checks whether the uniqueness of merge keys.Key uniqueness is checked before merge operations and can protect against memory overflowsand unexpected key duplication.
In [69]:left=pd.DataFrame({"A":[1,2],"B":[1,2]})In [70]:right=pd.DataFrame({"A":[4,5,6],"B":[2,2,2]})In [71]:result=pd.merge(left,right,on="B",how="outer",validate="one_to_one")---------------------------------------------------------------------------MergeErrorTraceback (most recent call last)CellIn[71],line1---->1result=pd.merge(left,right,on="B",how="outer",validate="one_to_one")File ~/work/pandas/pandas/pandas/core/reshape/merge.py:170, inmerge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)155return_cross_merge(156left_df,157right_df,(...)167copy=copy,168)169else:-->170op=_MergeOperation(171left_df,172right_df,173how=how,174on=on,175left_on=left_on,176right_on=right_on,177left_index=left_index,178right_index=right_index,179sort=sort,180suffixes=suffixes,181indicator=indicator,182validate=validate,183)184returnop.get_result(copy=copy)File ~/work/pandas/pandas/pandas/core/reshape/merge.py:813, in_MergeOperation.__init__(self, left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, indicator, validate)809# If argument passed to validate,810# check if columns specified as unique811# are in fact unique.812ifvalidateisnotNone:-->813self._validate_validate_kwd(validate)File ~/work/pandas/pandas/pandas/core/reshape/merge.py:1657, in_MergeOperation._validate_validate_kwd(self, validate)1653raiseMergeError(1654"Merge keys are not unique in left dataset; not a one-to-one merge"1655)1656ifnotright_unique:->1657raiseMergeError(1658"Merge keys are not unique in right dataset; not a one-to-one merge"1659)1661elifvalidatein["one_to_many","1:m"]:1662ifnotleft_unique:MergeError: Merge keys are not unique in right dataset; not a one-to-one merge
If the user is aware of the duplicates in the rightDataFrame
but wants toensure there are no duplicates in the leftDataFrame
, one can use thevalidate='one_to_many'
argument instead, which will not raise an exception.
In [72]:pd.merge(left,right,on="B",how="outer",validate="one_to_many")Out[72]: A_x B A_y0 1 1 NaN1 2 2 4.02 2 2 5.03 2 2 6.0
Merge result indicator#
merge()
accepts the argumentindicator
. IfTrue
, aCategorical-type column called_merge
will be added to the output objectthat takes on values:
Observation Origin
_merge
valueMerge key only in
'left'
frame
left_only
Merge key only in
'right'
frame
right_only
Merge key in both frames
both
In [73]:df1=pd.DataFrame({"col1":[0,1],"col_left":["a","b"]})In [74]:df2=pd.DataFrame({"col1":[1,2,2],"col_right":[2,2,2]})In [75]:pd.merge(df1,df2,on="col1",how="outer",indicator=True)Out[75]: col1 col_left col_right _merge0 0 a NaN left_only1 1 b 2.0 both2 2 NaN 2.0 right_only3 2 NaN 2.0 right_only
A string argument toindicator
will use the value as the name for the indicator column.
In [76]:pd.merge(df1,df2,on="col1",how="outer",indicator="indicator_column")Out[76]: col1 col_left col_right indicator_column0 0 a NaN left_only1 1 b 2.0 both2 2 NaN 2.0 right_only3 2 NaN 2.0 right_only
Overlapping value columns#
The mergesuffixes
argument takes a tuple of list of strings to append tooverlapping column names in the inputDataFrame
to disambiguate the resultcolumns:
In [77]:left=pd.DataFrame({"k":["K0","K1","K2"],"v":[1,2,3]})In [78]:right=pd.DataFrame({"k":["K0","K0","K3"],"v":[4,5,6]})In [79]:result=pd.merge(left,right,on="k")In [80]:resultOut[80]: k v_x v_y0 K0 1 41 K0 1 5

In [81]:result=pd.merge(left,right,on="k",suffixes=("_l","_r"))In [82]:resultOut[82]: k v_l v_r0 K0 1 41 K0 1 5

DataFrame.join()
#
DataFrame.join()
combines the columns of multiple,potentially differently-indexedDataFrame
into a single resultDataFrame
.
In [83]:left=pd.DataFrame( ....:{"A":["A0","A1","A2"],"B":["B0","B1","B2"]},index=["K0","K1","K2"] ....:) ....:In [84]:right=pd.DataFrame( ....:{"C":["C0","C2","C3"],"D":["D0","D2","D3"]},index=["K0","K2","K3"] ....:) ....:In [85]:result=left.join(right)In [86]:resultOut[86]: A B C DK0 A0 B0 C0 D0K1 A1 B1 NaN NaNK2 A2 B2 C2 D2

In [87]:result=left.join(right,how="outer")In [88]:resultOut[88]: A B C DK0 A0 B0 C0 D0K1 A1 B1 NaN NaNK2 A2 B2 C2 D2K3 NaN NaN C3 D3

In [89]:result=left.join(right,how="inner")In [90]:resultOut[90]: A B C DK0 A0 B0 C0 D0K2 A2 B2 C2 D2

DataFrame.join()
takes an optionalon
argument which may be a columnor multiple column names that the passedDataFrame
is to bealigned.
In [91]:left=pd.DataFrame( ....:{ ....:"A":["A0","A1","A2","A3"], ....:"B":["B0","B1","B2","B3"], ....:"key":["K0","K1","K0","K1"], ....:} ....:) ....:In [92]:right=pd.DataFrame({"C":["C0","C1"],"D":["D0","D1"]},index=["K0","K1"])In [93]:result=left.join(right,on="key")In [94]:resultOut[94]: A B key C D0 A0 B0 K0 C0 D01 A1 B1 K1 C1 D12 A2 B2 K0 C0 D03 A3 B3 K1 C1 D1

In [95]:result=pd.merge( ....:left,right,left_on="key",right_index=True,how="left",sort=False ....:) ....:In [96]:resultOut[96]: A B key C D0 A0 B0 K0 C0 D01 A1 B1 K1 C1 D12 A2 B2 K0 C0 D03 A3 B3 K1 C1 D1

To join on multiple keys, the passedDataFrame
must have aMultiIndex
:
In [97]:left=pd.DataFrame( ....:{ ....:"A":["A0","A1","A2","A3"], ....:"B":["B0","B1","B2","B3"], ....:"key1":["K0","K0","K1","K2"], ....:"key2":["K0","K1","K0","K1"], ....:} ....:) ....:In [98]:index=pd.MultiIndex.from_tuples( ....:[("K0","K0"),("K1","K0"),("K2","K0"),("K2","K1")] ....:) ....:In [99]:right=pd.DataFrame( ....:{"C":["C0","C1","C2","C3"],"D":["D0","D1","D2","D3"]},index=index ....:) ....:In [100]:result=left.join(right,on=["key1","key2"])In [101]:resultOut[101]: A B key1 key2 C D0 A0 B0 K0 K0 C0 D01 A1 B1 K0 K1 NaN NaN2 A2 B2 K1 K0 C1 D13 A3 B3 K2 K1 C3 D3

The default forDataFrame.join
is to perform a left joinwhich uses only the keys found in thecallingDataFrame
. Other join types can be specified withhow
.
In [102]:result=left.join(right,on=["key1","key2"],how="inner")In [103]:resultOut[103]: A B key1 key2 C D0 A0 B0 K0 K0 C0 D02 A2 B2 K1 K0 C1 D13 A3 B3 K2 K1 C3 D3

Joining a single Index to a MultiIndex#
You can join aDataFrame
with aIndex
to aDataFrame
with aMultiIndex
on a level.Thename
of theIndex
with match the level name of theMultiIndex
.
In [104]:left=pd.DataFrame( .....:{"A":["A0","A1","A2"],"B":["B0","B1","B2"]}, .....:index=pd.Index(["K0","K1","K2"],name="key"), .....:) .....:In [105]:index=pd.MultiIndex.from_tuples( .....:[("K0","Y0"),("K1","Y1"),("K2","Y2"),("K2","Y3")], .....:names=["key","Y"], .....:) .....:In [106]:right=pd.DataFrame( .....:{"C":["C0","C1","C2","C3"],"D":["D0","D1","D2","D3"]}, .....:index=index, .....:) .....:In [107]:result=left.join(right,how="inner")In [108]:resultOut[108]: A B C Dkey YK0 Y0 A0 B0 C0 D0K1 Y1 A1 B1 C1 D1K2 Y2 A2 B2 C2 D2 Y3 A2 B2 C3 D3

Joining with twoMultiIndex
#
TheMultiIndex
of the input argument must be completely usedin the join and is a subset of the indices in the left argument.
In [109]:leftindex=pd.MultiIndex.from_product( .....:[list("abc"),list("xy"),[1,2]],names=["abc","xy","num"] .....:) .....:In [110]:left=pd.DataFrame({"v1":range(12)},index=leftindex)In [111]:leftOut[111]: v1abc xy numa x 1 0 2 1 y 1 2 2 3b x 1 4 2 5 y 1 6 2 7c x 1 8 2 9 y 1 10 2 11In [112]:rightindex=pd.MultiIndex.from_product( .....:[list("abc"),list("xy")],names=["abc","xy"] .....:) .....:In [113]:right=pd.DataFrame({"v2":[100*iforiinrange(1,7)]},index=rightindex)In [114]:rightOut[114]: v2abc xya x 100 y 200b x 300 y 400c x 500 y 600In [115]:left.join(right,on=["abc","xy"],how="inner")Out[115]: v1 v2abc xy numa x 1 0 100 2 1 100 y 1 2 200 2 3 200b x 1 4 300 2 5 300 y 1 6 400 2 7 400c x 1 8 500 2 9 500 y 1 10 600 2 11 600
In [116]:leftindex=pd.MultiIndex.from_tuples( .....:[("K0","X0"),("K0","X1"),("K1","X2")],names=["key","X"] .....:) .....:In [117]:left=pd.DataFrame( .....:{"A":["A0","A1","A2"],"B":["B0","B1","B2"]},index=leftindex .....:) .....:In [118]:rightindex=pd.MultiIndex.from_tuples( .....:[("K0","Y0"),("K1","Y1"),("K2","Y2"),("K2","Y3")],names=["key","Y"] .....:) .....:In [119]:right=pd.DataFrame( .....:{"C":["C0","C1","C2","C3"],"D":["D0","D1","D2","D3"]},index=rightindex .....:) .....:In [120]:result=pd.merge( .....:left.reset_index(),right.reset_index(),on=["key"],how="inner" .....:).set_index(["key","X","Y"]) .....:In [121]:resultOut[121]: A B C Dkey X YK0 X0 Y0 A0 B0 C0 D0 X1 Y0 A1 B1 C0 D0K1 X2 Y1 A2 B2 C1 D1

Merging on a combination of columns and index levels#
Strings passed as theon
,left_on
, andright_on
parametersmay refer to either column names or index level names. This enables mergingDataFrame
instances on a combination of index levels and columns withoutresetting indexes.
In [122]:left_index=pd.Index(["K0","K0","K1","K2"],name="key1")In [123]:left=pd.DataFrame( .....:{ .....:"A":["A0","A1","A2","A3"], .....:"B":["B0","B1","B2","B3"], .....:"key2":["K0","K1","K0","K1"], .....:}, .....:index=left_index, .....:) .....:In [124]:right_index=pd.Index(["K0","K1","K2","K2"],name="key1")In [125]:right=pd.DataFrame( .....:{ .....:"C":["C0","C1","C2","C3"], .....:"D":["D0","D1","D2","D3"], .....:"key2":["K0","K0","K0","K1"], .....:}, .....:index=right_index, .....:) .....:In [126]:result=left.merge(right,on=["key1","key2"])In [127]:resultOut[127]: A B key2 C Dkey1K0 A0 B0 K0 C0 D0K1 A2 B2 K0 C1 D1K2 A3 B3 K1 C3 D3

Note
WhenDataFrame
are joined on a string that matches an index level in botharguments, the index level is preserved as an index level in the resultingDataFrame
.
Note
WhenDataFrame
are joined using only some of the levels of aMultiIndex
,the extra levels will be dropped from the resulting join. Topreserve those levels, useDataFrame.reset_index()
on those levelnames to move those levels to columns prior to the join.
Joining multipleDataFrame
#
A list or tuple of:class:`DataFrame`
can also be passed tojoin()
to join them together on their indexes.
In [128]:right2=pd.DataFrame({"v":[7,8,9]},index=["K1","K1","K2"])In [129]:result=left.join([right,right2])

DataFrame.combine_first()
#
DataFrame.combine_first()
update missing values from oneDataFrame
with the non-missing values in anotherDataFrame
in the correspondinglocation.
In [130]:df1=pd.DataFrame( .....:[[np.nan,3.0,5.0],[-4.6,np.nan,np.nan],[np.nan,7.0,np.nan]] .....:) .....:In [131]:df2=pd.DataFrame([[-42.6,np.nan,-8.2],[-5.0,1.6,4]],index=[1,2])In [132]:result=df1.combine_first(df2)In [133]:resultOut[133]: 0 1 20 NaN 3.0 5.01 -4.6 NaN -8.22 -5.0 7.0 4.0

merge_ordered()
#
merge_ordered()
combines order data such as numeric or time series datawith optional filling of missing data withfill_method
.
In [134]:left=pd.DataFrame( .....:{"k":["K0","K1","K1","K2"],"lv":[1,2,3,4],"s":["a","b","c","d"]} .....:) .....:In [135]:right=pd.DataFrame({"k":["K1","K2","K4"],"rv":[1,2,3]})In [136]:pd.merge_ordered(left,right,fill_method="ffill",left_by="s")Out[136]: k lv s rv0 K0 1.0 a NaN1 K1 1.0 a 1.02 K2 1.0 a 2.03 K4 1.0 a 3.04 K1 2.0 b 1.05 K2 2.0 b 2.06 K4 2.0 b 3.07 K1 3.0 c 1.08 K2 3.0 c 2.09 K4 3.0 c 3.010 K1 NaN d 1.011 K2 4.0 d 2.012 K4 4.0 d 3.0
merge_asof()
#
merge_asof()
is similar to an ordered left-join except that mactches are on thenearest key rather than equal keys. For each row in theleft
DataFrame
,the last row in theright
DataFrame
are selected where theon
key is lessthan the left’s key. BothDataFrame
must be sorted by the key.
Optionally anmerge_asof()
can perform a group-wise merge by matching theby
key in addition to the nearest match on theon
key.
In [137]:trades=pd.DataFrame( .....:{ .....:"time":pd.to_datetime( .....:[ .....:"20160525 13:30:00.023", .....:"20160525 13:30:00.038", .....:"20160525 13:30:00.048", .....:"20160525 13:30:00.048", .....:"20160525 13:30:00.048", .....:] .....:), .....:"ticker":["MSFT","MSFT","GOOG","GOOG","AAPL"], .....:"price":[51.95,51.95,720.77,720.92,98.00], .....:"quantity":[75,155,100,100,100], .....:}, .....:columns=["time","ticker","price","quantity"], .....:) .....:In [138]:quotes=pd.DataFrame( .....:{ .....:"time":pd.to_datetime( .....:[ .....:"20160525 13:30:00.023", .....:"20160525 13:30:00.023", .....:"20160525 13:30:00.030", .....:"20160525 13:30:00.041", .....:"20160525 13:30:00.048", .....:"20160525 13:30:00.049", .....:"20160525 13:30:00.072", .....:"20160525 13:30:00.075", .....:] .....:), .....:"ticker":["GOOG","MSFT","MSFT","MSFT","GOOG","AAPL","GOOG","MSFT"], .....:"bid":[720.50,51.95,51.97,51.99,720.50,97.99,720.50,52.01], .....:"ask":[720.93,51.96,51.98,52.00,720.93,98.01,720.88,52.03], .....:}, .....:columns=["time","ticker","bid","ask"], .....:) .....:In [139]:tradesOut[139]: time ticker price quantity0 2016-05-25 13:30:00.023 MSFT 51.95 751 2016-05-25 13:30:00.038 MSFT 51.95 1552 2016-05-25 13:30:00.048 GOOG 720.77 1003 2016-05-25 13:30:00.048 GOOG 720.92 1004 2016-05-25 13:30:00.048 AAPL 98.00 100In [140]:quotesOut[140]: time ticker bid ask0 2016-05-25 13:30:00.023 GOOG 720.50 720.931 2016-05-25 13:30:00.023 MSFT 51.95 51.962 2016-05-25 13:30:00.030 MSFT 51.97 51.983 2016-05-25 13:30:00.041 MSFT 51.99 52.004 2016-05-25 13:30:00.048 GOOG 720.50 720.935 2016-05-25 13:30:00.049 AAPL 97.99 98.016 2016-05-25 13:30:00.072 GOOG 720.50 720.887 2016-05-25 13:30:00.075 MSFT 52.01 52.03In [141]:pd.merge_asof(trades,quotes,on="time",by="ticker")Out[141]: time ticker price quantity bid ask0 2016-05-25 13:30:00.023 MSFT 51.95 75 51.95 51.961 2016-05-25 13:30:00.038 MSFT 51.95 155 51.97 51.982 2016-05-25 13:30:00.048 GOOG 720.77 100 720.50 720.933 2016-05-25 13:30:00.048 GOOG 720.92 100 720.50 720.934 2016-05-25 13:30:00.048 AAPL 98.00 100 NaN NaN
merge_asof()
within2ms
between the quote time and the trade time.
In [142]:pd.merge_asof(trades,quotes,on="time",by="ticker",tolerance=pd.Timedelta("2ms"))Out[142]: time ticker price quantity bid ask0 2016-05-25 13:30:00.023 MSFT 51.95 75 51.95 51.961 2016-05-25 13:30:00.038 MSFT 51.95 155 NaN NaN2 2016-05-25 13:30:00.048 GOOG 720.77 100 720.50 720.933 2016-05-25 13:30:00.048 GOOG 720.92 100 720.50 720.934 2016-05-25 13:30:00.048 AAPL 98.00 100 NaN NaN
merge_asof()
within10ms
between the quote time and the trade time andexclude exact matches on time. Note that though we exclude the exact matches(of the quotes), prior quotesdo propagate to that point in time.
In [143]:pd.merge_asof( .....:trades, .....:quotes, .....:on="time", .....:by="ticker", .....:tolerance=pd.Timedelta("10ms"), .....:allow_exact_matches=False, .....:) .....:Out[143]: time ticker price quantity bid ask0 2016-05-25 13:30:00.023 MSFT 51.95 75 NaN NaN1 2016-05-25 13:30:00.038 MSFT 51.95 155 51.97 51.982 2016-05-25 13:30:00.048 GOOG 720.77 100 NaN NaN3 2016-05-25 13:30:00.048 GOOG 720.92 100 NaN NaN4 2016-05-25 13:30:00.048 AAPL 98.00 100 NaN NaN
compare()
#
TheSeries.compare()
andDataFrame.compare()
methods allow you tocompare twoDataFrame
orSeries
, respectively, and summarize their differences.
In [144]:df=pd.DataFrame( .....:{ .....:"col1":["a","a","b","b","a"], .....:"col2":[1.0,2.0,3.0,np.nan,5.0], .....:"col3":[1.0,2.0,3.0,4.0,5.0], .....:}, .....:columns=["col1","col2","col3"], .....:) .....:In [145]:dfOut[145]: col1 col2 col30 a 1.0 1.01 a 2.0 2.02 b 3.0 3.03 b NaN 4.04 a 5.0 5.0In [146]:df2=df.copy()In [147]:df2.loc[0,"col1"]="c"In [148]:df2.loc[2,"col3"]=4.0In [149]:df2Out[149]: col1 col2 col30 c 1.0 1.01 a 2.0 2.02 b 3.0 4.03 b NaN 4.04 a 5.0 5.0In [150]:df.compare(df2)Out[150]: col1 col3 self other self other0 a c NaN NaN2 NaN NaN 3.0 4.0
By default, if two corresponding values are equal, they will be shown asNaN
.Furthermore, if all values in an entire row / column, the row / column will beomitted from the result. The remaining differences will be aligned on columns.
Stack the differences on rows.
In [151]:df.compare(df2,align_axis=0)Out[151]: col1 col30 self a NaN other c NaN2 self NaN 3.0 other NaN 4.0
Keep all original rows and columns withkeep_shape=True
In [152]:df.compare(df2,keep_shape=True)Out[152]: col1 col2 col3 self other self other self other0 a c NaN NaN NaN NaN1 NaN NaN NaN NaN NaN NaN2 NaN NaN NaN NaN 3.0 4.03 NaN NaN NaN NaN NaN NaN4 NaN NaN NaN NaN NaN NaN
Keep all the original values even if they are equal.
In [153]:df.compare(df2,keep_shape=True,keep_equal=True)Out[153]: col1 col2 col3 self other self other self other0 a c 1.0 1.0 1.0 1.01 a a 2.0 2.0 2.0 2.02 b b 3.0 3.0 3.0 4.03 b b NaN NaN 4.0 4.04 a a 5.0 5.0 5.0 5.0