Merge, join, concatenate and compare #

pandas provides various methods for combining and comparingSeries orDataFrame.

concat(): Merge multipleSeries orDataFrame objects along a shared index or column
DataFrame.join(): Merge multipleDataFrame objects along the columns
DataFrame.combine_first(): Update missing values with non-missing values in the same location
merge(): Combine twoSeries orDataFrame objects with SQL-style joining
merge_ordered(): Combine twoSeries orDataFrame objects along an ordered axis
merge_asof(): Combine twoSeries orDataFrame objects by near instead of exact matching keys
Series.compare() andDataFrame.compare(): Show differences in values between twoSeries orDataFrame objects

`concat()`#

Theconcat() function concatenates an arbitrary amount ofSeries orDataFrame objects along an axis whileperforming optional set logic (union or intersection) of the indexes onthe other axes. Likenumpy.concatenate,concat()takes a list or dict of homogeneously-typed objects and concatenates them.

In [1]:df1=pd.DataFrame(   ...:{   ...:"A":["A0","A1","A2","A3"],   ...:"B":["B0","B1","B2","B3"],   ...:"C":["C0","C1","C2","C3"],   ...:"D":["D0","D1","D2","D3"],   ...:},   ...:index=[0,1,2,3],   ...:)   ...:In [2]:df2=pd.DataFrame(   ...:{   ...:"A":["A4","A5","A6","A7"],   ...:"B":["B4","B5","B6","B7"],   ...:"C":["C4","C5","C6","C7"],   ...:"D":["D4","D5","D6","D7"],   ...:},   ...:index=[4,5,6,7],   ...:)   ...:In [3]:df3=pd.DataFrame(   ...:{   ...:"A":["A8","A9","A10","A11"],   ...:"B":["B8","B9","B10","B11"],   ...:"C":["C8","C9","C10","C11"],   ...:"D":["D8","D9","D10","D11"],   ...:},   ...:index=[8,9,10,11],   ...:)   ...:In [4]:frames=[df1,df2,df3]In [5]:result=pd.concat(frames)In [6]:resultOut[6]:      A    B    C    D0    A0   B0   C0   D01    A1   B1   C1   D12    A2   B2   C2   D23    A3   B3   C3   D34    A4   B4   C4   D45    A5   B5   C5   D56    A6   B6   C6   D67    A7   B7   C7   D78    A8   B8   C8   D89    A9   B9   C9   D910  A10  B10  C10  D1011  A11  B11  C11  D11

Note

concat() makes a full copy of the data, and iterativelyreusingconcat() can create unnecessary copies. Collect allDataFrame orSeries objects in a list before usingconcat().

frames=[process_your_file(f)forfinfiles]result=pd.concat(frames)

Note

When concatenatingDataFrame with named axes, pandas will attempt to preservethese index/column names whenever possible. In the case where all inputs share acommon name, this name will be assigned to the result. When the input names donot all agree, the result will be unnamed. The same is true forMultiIndex,but the logic is applied separately on a level-by-level basis.

Joining logic of the resulting axis#

Thejoin keyword specifies how to handle axis values that don’t exist in the firstDataFrame.

join='outer' takes the union of all axis values

In [7]:df4=pd.DataFrame(   ...:{   ...:"B":["B2","B3","B6","B7"],   ...:"D":["D2","D3","D6","D7"],   ...:"F":["F2","F3","F6","F7"],   ...:},   ...:index=[2,3,6,7],   ...:)   ...:In [8]:result=pd.concat([df1,df4],axis=1)In [9]:resultOut[9]:     A    B    C    D    B    D    F0   A0   B0   C0   D0  NaN  NaN  NaN1   A1   B1   C1   D1  NaN  NaN  NaN2   A2   B2   C2   D2   B2   D2   F23   A3   B3   C3   D3   B3   D3   F36  NaN  NaN  NaN  NaN   B6   D6   F67  NaN  NaN  NaN  NaN   B7   D7   F7

join='inner' takes the intersection of the axis values

In [10]:result=pd.concat([df1,df4],axis=1,join="inner")In [11]:resultOut[11]:    A   B   C   D   B   D   F2  A2  B2  C2  D2  B2  D2  F23  A3  B3  C3  D3  B3  D3  F3

../_images/merging_concat_axis1_inner.png

To perform an effective “left” join using theexact index from the originalDataFrame, result can be reindexed.

In [12]:result=pd.concat([df1,df4],axis=1).reindex(df1.index)In [13]:resultOut[13]:    A   B   C   D    B    D    F0  A0  B0  C0  D0  NaN  NaN  NaN1  A1  B1  C1  D1  NaN  NaN  NaN2  A2  B2  C2  D2   B2   D2   F23  A3  B3  C3  D3   B3   D3   F3

../_images/merging_concat_axis1_join_axes.png

Ignoring indexes on the concatenation axis#

ForDataFrame objects which don’t have a meaningful index, theignore_indexignores overlapping indexes.

In [14]:result=pd.concat([df1,df4],ignore_index=True,sort=False)In [15]:resultOut[15]:     A   B    C   D    F0   A0  B0   C0  D0  NaN1   A1  B1   C1  D1  NaN2   A2  B2   C2  D2  NaN3   A3  B3   C3  D3  NaN4  NaN  B2  NaN  D2   F25  NaN  B3  NaN  D3   F36  NaN  B6  NaN  D6   F67  NaN  B7  NaN  D7   F7

../_images/merging_concat_ignore_index.png

Concatenating`Series` and`DataFrame` together#

You can concatenate a mix ofSeries andDataFrame objects. TheSeries will be transformed toDataFrame with the column name asthe name of theSeries.

In [16]:s1=pd.Series(["X0","X1","X2","X3"],name="X")In [17]:result=pd.concat([df1,s1],axis=1)In [18]:resultOut[18]:    A   B   C   D   X0  A0  B0  C0  D0  X01  A1  B1  C1  D1  X12  A2  B2  C2  D2  X23  A3  B3  C3  D3  X3

../_images/merging_concat_mixed_ndim.png

UnnamedSeries will be numbered consecutively.

In [19]:s2=pd.Series(["_0","_1","_2","_3"])In [20]:result=pd.concat([df1,s2,s2,s2],axis=1)In [21]:resultOut[21]:    A   B   C   D   0   1   20  A0  B0  C0  D0  _0  _0  _01  A1  B1  C1  D1  _1  _1  _12  A2  B2  C2  D2  _2  _2  _23  A3  B3  C3  D3  _3  _3  _3

../_images/merging_concat_unnamed_series.png

ignore_index=True will drop all name references.

In [22]:result=pd.concat([df1,s1],axis=1,ignore_index=True)In [23]:resultOut[23]:    0   1   2   3   40  A0  B0  C0  D0  X01  A1  B1  C1  D1  X12  A2  B2  C2  D2  X23  A3  B3  C3  D3  X3

../_images/merging_concat_series_ignore_index.png

Resulting`keys`#

Thekeys argument adds another axis level to the resulting index or column (creatingaMultiIndex) associate specific keys with each originalDataFrame.

In [24]:result=pd.concat(frames,keys=["x","y","z"])In [25]:resultOut[25]:        A    B    C    Dx 0    A0   B0   C0   D0  1    A1   B1   C1   D1  2    A2   B2   C2   D2  3    A3   B3   C3   D3y 4    A4   B4   C4   D4  5    A5   B5   C5   D5  6    A6   B6   C6   D6  7    A7   B7   C7   D7z 8    A8   B8   C8   D8  9    A9   B9   C9   D9  10  A10  B10  C10  D10  11  A11  B11  C11  D11In [26]:result.loc["y"]Out[26]:    A   B   C   D4  A4  B4  C4  D45  A5  B5  C5  D56  A6  B6  C6  D67  A7  B7  C7  D7

Thekeys argument cane override the column nameswhen creating a newDataFrame based on existingSeries.

In [27]:s3=pd.Series([0,1,2,3],name="foo")In [28]:s4=pd.Series([0,1,2,3])In [29]:s5=pd.Series([0,1,4,5])In [30]:pd.concat([s3,s4,s5],axis=1)Out[30]:   foo  0  10    0  0  01    1  1  12    2  2  43    3  3  5In [31]:pd.concat([s3,s4,s5],axis=1,keys=["red","blue","yellow"])Out[31]:   red  blue  yellow0    0     0       01    1     1       12    2     2       43    3     3       5

You can also pass a dict toconcat() in which case the dict keys will be usedfor thekeys argument unless otherkeys argument is specified:

In [32]:pieces={"x":df1,"y":df2,"z":df3}In [33]:result=pd.concat(pieces)In [34]:resultOut[34]:        A    B    C    Dx 0    A0   B0   C0   D0  1    A1   B1   C1   D1  2    A2   B2   C2   D2  3    A3   B3   C3   D3y 4    A4   B4   C4   D4  5    A5   B5   C5   D5  6    A6   B6   C6   D6  7    A7   B7   C7   D7z 8    A8   B8   C8   D8  9    A9   B9   C9   D9  10  A10  B10  C10  D10  11  A11  B11  C11  D11

In [35]:result=pd.concat(pieces,keys=["z","y"])In [36]:resultOut[36]:        A    B    C    Dz 8    A8   B8   C8   D8  9    A9   B9   C9   D9  10  A10  B10  C10  D10  11  A11  B11  C11  D11y 4    A4   B4   C4   D4  5    A5   B5   C5   D5  6    A6   B6   C6   D6  7    A7   B7   C7   D7

TheMultiIndex created has levels that are constructed from the passed keys andthe index of theDataFrame pieces:

In [37]:result.index.levelsOut[37]:FrozenList([['z', 'y'], [4, 5, 6, 7, 8, 9, 10, 11]])

levels argument allows specifying resulting levels associated with thekeys

In [38]:result=pd.concat(   ....:pieces,keys=["x","y","z"],levels=[["z","y","x","w"]],names=["group_key"]   ....:)   ....:In [39]:resultOut[39]:                A    B    C    Dgroup_keyx         0    A0   B0   C0   D0          1    A1   B1   C1   D1          2    A2   B2   C2   D2          3    A3   B3   C3   D3y         4    A4   B4   C4   D4          5    A5   B5   C5   D5          6    A6   B6   C6   D6          7    A7   B7   C7   D7z         8    A8   B8   C8   D8          9    A9   B9   C9   D9          10  A10  B10  C10  D10          11  A11  B11  C11  D11

../_images/merging_concat_dict_keys_names.png

In [40]:result.index.levelsOut[40]:FrozenList([['z', 'y', 'x', 'w'], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]])

Appending rows to a`DataFrame`#

If you have aSeries that you want to append as a single row to aDataFrame, you can convert the row into aDataFrame and useconcat()

In [41]:s2=pd.Series(["X0","X1","X2","X3"],index=["A","B","C","D"])In [42]:result=pd.concat([df1,s2.to_frame().T],ignore_index=True)In [43]:resultOut[43]:    A   B   C   D0  A0  B0  C0  D01  A1  B1  C1  D12  A2  B2  C2  D23  A3  B3  C3  D34  X0  X1  X2  X3

../_images/merging_append_series_as_row.png

`merge()`#

merge() performs join operations similar to relational databases like SQL.Users who are familiar with SQL but new to pandas can reference acomparison with SQL.

Merge types#

merge() implements common SQL style joining operations.

one-to-one: joining twoDataFrame objects ontheir indexes which must contain unique values.
many-to-one: joining a unique index to one ormore columns in a differentDataFrame.
many-to-many : joining columns on columns.

Note

When joining columns on columns, potentially a many-to-many join, anyindexes on the passedDataFrame objectswill be discarded.

For amany-to-many join, if a key combination appearsmore than once in both tables, theDataFrame will have theCartesianproduct of the associated data.

In [44]:left=pd.DataFrame(   ....:{   ....:"key":["K0","K1","K2","K3"],   ....:"A":["A0","A1","A2","A3"],   ....:"B":["B0","B1","B2","B3"],   ....:}   ....:)   ....:In [45]:right=pd.DataFrame(   ....:{   ....:"key":["K0","K1","K2","K3"],   ....:"C":["C0","C1","C2","C3"],   ....:"D":["D0","D1","D2","D3"],   ....:}   ....:)   ....:In [46]:result=pd.merge(left,right,on="key")In [47]:resultOut[47]:  key   A   B   C   D0  K0  A0  B0  C0  D01  K1  A1  B1  C1  D12  K2  A2  B2  C2  D23  K3  A3  B3  C3  D3

Thehow argument tomerge() specifies which keys areincluded in the resulting table. If a key combinationdoes not appear ineither the left or right tables, the values in the joined table will beNA. Here is a summary of thehow options and their SQL equivalent names:

Merge method	SQL Join Name	Description
`left`	`LEFTOUTERJOIN`	Use keys from left frame only
`right`	`RIGHTOUTERJOIN`	Use keys from right frame only
`outer`	`FULLOUTERJOIN`	Use union of keys from both frames
`inner`	`INNERJOIN`	Use intersection of keys from both frames
`cross`	`CROSSJOIN`	Create the cartesian product of rows of both frames

In [48]:left=pd.DataFrame(   ....:{   ....:"key1":["K0","K0","K1","K2"],   ....:"key2":["K0","K1","K0","K1"],   ....:"A":["A0","A1","A2","A3"],   ....:"B":["B0","B1","B2","B3"],   ....:}   ....:)   ....:In [49]:right=pd.DataFrame(   ....:{   ....:"key1":["K0","K1","K1","K2"],   ....:"key2":["K0","K0","K0","K0"],   ....:"C":["C0","C1","C2","C3"],   ....:"D":["D0","D1","D2","D3"],   ....:}   ....:)   ....:In [50]:result=pd.merge(left,right,how="left",on=["key1","key2"])In [51]:resultOut[51]:  key1 key2   A   B    C    D0   K0   K0  A0  B0   C0   D01   K0   K1  A1  B1  NaN  NaN2   K1   K0  A2  B2   C1   D13   K1   K0  A2  B2   C2   D24   K2   K1  A3  B3  NaN  NaN

../_images/merging_merge_on_key_left.png

In [52]:result=pd.merge(left,right,how="right",on=["key1","key2"])In [53]:resultOut[53]:  key1 key2    A    B   C   D0   K0   K0   A0   B0  C0  D01   K1   K0   A2   B2  C1  D12   K1   K0   A2   B2  C2  D23   K2   K0  NaN  NaN  C3  D3

../_images/merging_merge_on_key_right.png

In [54]:result=pd.merge(left,right,how="outer",on=["key1","key2"])In [55]:resultOut[55]:  key1 key2    A    B    C    D0   K0   K0   A0   B0   C0   D01   K0   K1   A1   B1  NaN  NaN2   K1   K0   A2   B2   C1   D13   K1   K0   A2   B2   C2   D24   K2   K0  NaN  NaN   C3   D35   K2   K1   A3   B3  NaN  NaN

../_images/merging_merge_on_key_outer.png

In [56]:result=pd.merge(left,right,how="inner",on=["key1","key2"])In [57]:resultOut[57]:  key1 key2   A   B   C   D0   K0   K0  A0  B0  C0  D01   K1   K0  A2  B2  C1  D12   K1   K0  A2  B2  C2  D2

../_images/merging_merge_on_key_inner.png

In [58]:result=pd.merge(left,right,how="cross")In [59]:resultOut[59]:   key1_x key2_x   A   B key1_y key2_y   C   D0      K0     K0  A0  B0     K0     K0  C0  D01      K0     K0  A0  B0     K1     K0  C1  D12      K0     K0  A0  B0     K1     K0  C2  D23      K0     K0  A0  B0     K2     K0  C3  D34      K0     K1  A1  B1     K0     K0  C0  D0..    ...    ...  ..  ..    ...    ...  ..  ..11     K1     K0  A2  B2     K2     K0  C3  D312     K2     K1  A3  B3     K0     K0  C0  D013     K2     K1  A3  B3     K1     K0  C1  D114     K2     K1  A3  B3     K1     K0  C2  D215     K2     K1  A3  B3     K2     K0  C3  D3[16 rows x 8 columns]

You canSeries and aDataFrame with aMultiIndex if the names oftheMultiIndex correspond to the columns from theDataFrame. TransformtheSeries to aDataFrame usingSeries.reset_index() before merging

In [60]:df=pd.DataFrame({"Let":["A","B","C"],"Num":[1,2,3]})In [61]:dfOut[61]:  Let  Num0   A    11   B    22   C    3In [62]:ser=pd.Series(   ....:["a","b","c","d","e","f"],   ....:index=pd.MultiIndex.from_arrays(   ....:[["A","B","C"]*2,[1,2,3,4,5,6]],names=["Let","Num"]   ....:),   ....:)   ....:In [63]:serOut[63]:Let  NumA    1      aB    2      bC    3      cA    4      dB    5      eC    6      fdtype: objectIn [64]:pd.merge(df,ser.reset_index(),on=["Let","Num"])Out[64]:  Let  Num  00   A    1  a1   B    2  b2   C    3  c

Performing an outer join with duplicate join keys inDataFrame

In [65]:left=pd.DataFrame({"A":[1,2],"B":[2,2]})In [66]:right=pd.DataFrame({"A":[4,5,6],"B":[2,2,2]})In [67]:result=pd.merge(left,right,on="B",how="outer")In [68]:resultOut[68]:   A_x  B  A_y0    1  2    41    1  2    52    1  2    63    2  2    44    2  2    55    2  2    6

Warning

Merging on duplicate keys significantly increase the dimensions of the resultand can cause a memory overflow.

Merge key uniqueness#

Thevalidate argument checks whether the uniqueness of merge keys.Key uniqueness is checked before merge operations and can protect against memory overflowsand unexpected key duplication.

In [69]:left=pd.DataFrame({"A":[1,2],"B":[1,2]})In [70]:right=pd.DataFrame({"A":[4,5,6],"B":[2,2,2]})In [71]:result=pd.merge(left,right,on="B",how="outer",validate="one_to_one")---------------------------------------------------------------------------MergeErrorTraceback (most recent call last)CellIn[71],line1---->1result=pd.merge(left,right,on="B",how="outer",validate="one_to_one")File ~/work/pandas/pandas/pandas/core/reshape/merge.py:170, inmerge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)155return_cross_merge(156left_df,157right_df,(...)167copy=copy,168)169else:-->170op=_MergeOperation(171left_df,172right_df,173how=how,174on=on,175left_on=left_on,176right_on=right_on,177left_index=left_index,178right_index=right_index,179sort=sort,180suffixes=suffixes,181indicator=indicator,182validate=validate,183)184returnop.get_result(copy=copy)File ~/work/pandas/pandas/pandas/core/reshape/merge.py:813, in_MergeOperation.__init__(self, left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, indicator, validate)809# If argument passed to validate,810# check if columns specified as unique811# are in fact unique.812ifvalidateisnotNone:-->813self._validate_validate_kwd(validate)File ~/work/pandas/pandas/pandas/core/reshape/merge.py:1657, in_MergeOperation._validate_validate_kwd(self, validate)1653raiseMergeError(1654"Merge keys are not unique in left dataset; not a one-to-one merge"1655)1656ifnotright_unique:->1657raiseMergeError(1658"Merge keys are not unique in right dataset; not a one-to-one merge"1659)1661elifvalidatein["one_to_many","1:m"]:1662ifnotleft_unique:MergeError: Merge keys are not unique in right dataset; not a one-to-one merge

If the user is aware of the duplicates in the rightDataFrame but wants toensure there are no duplicates in the leftDataFrame, one can use thevalidate='one_to_many' argument instead, which will not raise an exception.

In [72]:pd.merge(left,right,on="B",how="outer",validate="one_to_many")Out[72]:   A_x  B  A_y0    1  1  NaN1    2  2  4.02    2  2  5.03    2  2  6.0

Merge result indicator#

merge() accepts the argumentindicator. IfTrue, aCategorical-type column called_merge will be added to the output objectthat takes on values:

Observation Origin
_merge value
Merge key only in'left' frame
left_only
Merge key only in'right' frame
right_only
Merge key in both frames
both

Observation Origin	`_merge` value
Merge key only in`'left'` frame	`left_only`
Merge key only in`'right'` frame	`right_only`
Merge key in both frames	`both`

In [73]:df1=pd.DataFrame({"col1":[0,1],"col_left":["a","b"]})In [74]:df2=pd.DataFrame({"col1":[1,2,2],"col_right":[2,2,2]})In [75]:pd.merge(df1,df2,on="col1",how="outer",indicator=True)Out[75]:   col1 col_left  col_right      _merge0     0        a        NaN   left_only1     1        b        2.0        both2     2      NaN        2.0  right_only3     2      NaN        2.0  right_only

A string argument toindicator will use the value as the name for the indicator column.

In [76]:pd.merge(df1,df2,on="col1",how="outer",indicator="indicator_column")Out[76]:   col1 col_left  col_right indicator_column0     0        a        NaN        left_only1     1        b        2.0             both2     2      NaN        2.0       right_only3     2      NaN        2.0       right_only

Overlapping value columns#

The mergesuffixes argument takes a tuple of list of strings to append tooverlapping column names in the inputDataFrame to disambiguate the resultcolumns:

In [77]:left=pd.DataFrame({"k":["K0","K1","K2"],"v":[1,2,3]})In [78]:right=pd.DataFrame({"k":["K0","K0","K3"],"v":[4,5,6]})In [79]:result=pd.merge(left,right,on="k")In [80]:resultOut[80]:    k  v_x  v_y0  K0    1    41  K0    1    5

In [81]:result=pd.merge(left,right,on="k",suffixes=("_l","_r"))In [82]:resultOut[82]:    k  v_l  v_r0  K0    1    41  K0    1    5

../_images/merging_merge_overlapped_suffix.png

`DataFrame.join()`#

DataFrame.join() combines the columns of multiple,potentially differently-indexedDataFrame into a single resultDataFrame.

In [83]:left=pd.DataFrame(   ....:{"A":["A0","A1","A2"],"B":["B0","B1","B2"]},index=["K0","K1","K2"]   ....:)   ....:In [84]:right=pd.DataFrame(   ....:{"C":["C0","C2","C3"],"D":["D0","D2","D3"]},index=["K0","K2","K3"]   ....:)   ....:In [85]:result=left.join(right)In [86]:resultOut[86]:     A   B    C    DK0  A0  B0   C0   D0K1  A1  B1  NaN  NaNK2  A2  B2   C2   D2

In [87]:result=left.join(right,how="outer")In [88]:resultOut[88]:      A    B    C    DK0   A0   B0   C0   D0K1   A1   B1  NaN  NaNK2   A2   B2   C2   D2K3  NaN  NaN   C3   D3

In [89]:result=left.join(right,how="inner")In [90]:resultOut[90]:     A   B   C   DK0  A0  B0  C0  D0K2  A2  B2  C2  D2

DataFrame.join() takes an optionalon argument which may be a columnor multiple column names that the passedDataFrame is to bealigned.

In [91]:left=pd.DataFrame(   ....:{   ....:"A":["A0","A1","A2","A3"],   ....:"B":["B0","B1","B2","B3"],   ....:"key":["K0","K1","K0","K1"],   ....:}   ....:)   ....:In [92]:right=pd.DataFrame({"C":["C0","C1"],"D":["D0","D1"]},index=["K0","K1"])In [93]:result=left.join(right,on="key")In [94]:resultOut[94]:    A   B key   C   D0  A0  B0  K0  C0  D01  A1  B1  K1  C1  D12  A2  B2  K0  C0  D03  A3  B3  K1  C1  D1

In [95]:result=pd.merge(   ....:left,right,left_on="key",right_index=True,how="left",sort=False   ....:)   ....:In [96]:resultOut[96]:    A   B key   C   D0  A0  B0  K0  C0  D01  A1  B1  K1  C1  D12  A2  B2  K0  C0  D03  A3  B3  K1  C1  D1

../_images/merging_merge_key_columns.png

To join on multiple keys, the passedDataFrame must have aMultiIndex:

In [97]:left=pd.DataFrame(   ....:{   ....:"A":["A0","A1","A2","A3"],   ....:"B":["B0","B1","B2","B3"],   ....:"key1":["K0","K0","K1","K2"],   ....:"key2":["K0","K1","K0","K1"],   ....:}   ....:)   ....:In [98]:index=pd.MultiIndex.from_tuples(   ....:[("K0","K0"),("K1","K0"),("K2","K0"),("K2","K1")]   ....:)   ....:In [99]:right=pd.DataFrame(   ....:{"C":["C0","C1","C2","C3"],"D":["D0","D1","D2","D3"]},index=index   ....:)   ....:In [100]:result=left.join(right,on=["key1","key2"])In [101]:resultOut[101]:    A   B key1 key2    C    D0  A0  B0   K0   K0   C0   D01  A1  B1   K0   K1  NaN  NaN2  A2  B2   K1   K0   C1   D13  A3  B3   K2   K1   C3   D3

The default forDataFrame.join is to perform a left joinwhich uses only the keys found in thecallingDataFrame. Other join types can be specified withhow.

In [102]:result=left.join(right,on=["key1","key2"],how="inner")In [103]:resultOut[103]:    A   B key1 key2   C   D0  A0  B0   K0   K0  C0  D02  A2  B2   K1   K0  C1  D13  A3  B3   K2   K1  C3  D3

../_images/merging_join_multikeys_inner.png

Joining a single Index to a MultiIndex#

You can join aDataFrame with aIndex to aDataFrame with aMultiIndex on a level.Thename of theIndex with match the level name of theMultiIndex.

In [104]:left=pd.DataFrame(   .....:{"A":["A0","A1","A2"],"B":["B0","B1","B2"]},   .....:index=pd.Index(["K0","K1","K2"],name="key"),   .....:)   .....:In [105]:index=pd.MultiIndex.from_tuples(   .....:[("K0","Y0"),("K1","Y1"),("K2","Y2"),("K2","Y3")],   .....:names=["key","Y"],   .....:)   .....:In [106]:right=pd.DataFrame(   .....:{"C":["C0","C1","C2","C3"],"D":["D0","D1","D2","D3"]},   .....:index=index,   .....:)   .....:In [107]:result=left.join(right,how="inner")In [108]:resultOut[108]:         A   B   C   Dkey YK0  Y0  A0  B0  C0  D0K1  Y1  A1  B1  C1  D1K2  Y2  A2  B2  C2  D2    Y3  A2  B2  C3  D3

../_images/merging_join_multiindex_inner.png

Joining with two`MultiIndex`#

TheMultiIndex of the input argument must be completely usedin the join and is a subset of the indices in the left argument.

In [109]:leftindex=pd.MultiIndex.from_product(   .....:[list("abc"),list("xy"),[1,2]],names=["abc","xy","num"]   .....:)   .....:In [110]:left=pd.DataFrame({"v1":range(12)},index=leftindex)In [111]:leftOut[111]:            v1abc xy numa   x  1     0       2     1    y  1     2       2     3b   x  1     4       2     5    y  1     6       2     7c   x  1     8       2     9    y  1    10       2    11In [112]:rightindex=pd.MultiIndex.from_product(   .....:[list("abc"),list("xy")],names=["abc","xy"]   .....:)   .....:In [113]:right=pd.DataFrame({"v2":[100*iforiinrange(1,7)]},index=rightindex)In [114]:rightOut[114]:         v2abc xya   x   100    y   200b   x   300    y   400c   x   500    y   600In [115]:left.join(right,on=["abc","xy"],how="inner")Out[115]:            v1   v2abc xy numa   x  1     0  100       2     1  100    y  1     2  200       2     3  200b   x  1     4  300       2     5  300    y  1     6  400       2     7  400c   x  1     8  500       2     9  500    y  1    10  600       2    11  600

In [116]:leftindex=pd.MultiIndex.from_tuples(   .....:[("K0","X0"),("K0","X1"),("K1","X2")],names=["key","X"]   .....:)   .....:In [117]:left=pd.DataFrame(   .....:{"A":["A0","A1","A2"],"B":["B0","B1","B2"]},index=leftindex   .....:)   .....:In [118]:rightindex=pd.MultiIndex.from_tuples(   .....:[("K0","Y0"),("K1","Y1"),("K2","Y2"),("K2","Y3")],names=["key","Y"]   .....:)   .....:In [119]:right=pd.DataFrame(   .....:{"C":["C0","C1","C2","C3"],"D":["D0","D1","D2","D3"]},index=rightindex   .....:)   .....:In [120]:result=pd.merge(   .....:left.reset_index(),right.reset_index(),on=["key"],how="inner"   .....:).set_index(["key","X","Y"])   .....:In [121]:resultOut[121]:            A   B   C   Dkey X  YK0  X0 Y0  A0  B0  C0  D0    X1 Y0  A1  B1  C0  D0K1  X2 Y1  A2  B2  C1  D1

../_images/merging_merge_two_multiindex.png

Merging on a combination of columns and index levels#

Strings passed as theon,left_on, andright_on parametersmay refer to either column names or index level names. This enables mergingDataFrame instances on a combination of index levels and columns withoutresetting indexes.

In [122]:left_index=pd.Index(["K0","K0","K1","K2"],name="key1")In [123]:left=pd.DataFrame(   .....:{   .....:"A":["A0","A1","A2","A3"],   .....:"B":["B0","B1","B2","B3"],   .....:"key2":["K0","K1","K0","K1"],   .....:},   .....:index=left_index,   .....:)   .....:In [124]:right_index=pd.Index(["K0","K1","K2","K2"],name="key1")In [125]:right=pd.DataFrame(   .....:{   .....:"C":["C0","C1","C2","C3"],   .....:"D":["D0","D1","D2","D3"],   .....:"key2":["K0","K0","K0","K1"],   .....:},   .....:index=right_index,   .....:)   .....:In [126]:result=left.merge(right,on=["key1","key2"])In [127]:resultOut[127]:       A   B key2   C   Dkey1K0    A0  B0   K0  C0  D0K1    A2  B2   K0  C1  D1K2    A3  B3   K1  C3  D3

../_images/merge_on_index_and_column.png

Note

WhenDataFrame are joined on a string that matches an index level in botharguments, the index level is preserved as an index level in the resultingDataFrame.

Note

WhenDataFrame are joined using only some of the levels of aMultiIndex,the extra levels will be dropped from the resulting join. Topreserve those levels, useDataFrame.reset_index() on those levelnames to move those levels to columns prior to the join.

Joining multiple`DataFrame`#

A list or tuple of:class:`DataFrame` can also be passed tojoin()to join them together on their indexes.

In [128]:right2=pd.DataFrame({"v":[7,8,9]},index=["K1","K1","K2"])In [129]:result=left.join([right,right2])

`DataFrame.combine_first()`#

DataFrame.combine_first() update missing values from oneDataFramewith the non-missing values in anotherDataFrame in the correspondinglocation.

In [130]:df1=pd.DataFrame(   .....:[[np.nan,3.0,5.0],[-4.6,np.nan,np.nan],[np.nan,7.0,np.nan]]   .....:)   .....:In [131]:df2=pd.DataFrame([[-42.6,np.nan,-8.2],[-5.0,1.6,4]],index=[1,2])In [132]:result=df1.combine_first(df2)In [133]:resultOut[133]:     0    1    20  NaN  3.0  5.01 -4.6  NaN -8.22 -5.0  7.0  4.0

`merge_ordered()`#

merge_ordered() combines order data such as numeric or time series datawith optional filling of missing data withfill_method.

In [134]:left=pd.DataFrame(   .....:{"k":["K0","K1","K1","K2"],"lv":[1,2,3,4],"s":["a","b","c","d"]}   .....:)   .....:In [135]:right=pd.DataFrame({"k":["K1","K2","K4"],"rv":[1,2,3]})In [136]:pd.merge_ordered(left,right,fill_method="ffill",left_by="s")Out[136]:     k   lv  s   rv0   K0  1.0  a  NaN1   K1  1.0  a  1.02   K2  1.0  a  2.03   K4  1.0  a  3.04   K1  2.0  b  1.05   K2  2.0  b  2.06   K4  2.0  b  3.07   K1  3.0  c  1.08   K2  3.0  c  2.09   K4  3.0  c  3.010  K1  NaN  d  1.011  K2  4.0  d  2.012  K4  4.0  d  3.0

`merge_asof()`#

merge_asof() is similar to an ordered left-join except that mactches are on thenearest key rather than equal keys. For each row in theleftDataFrame,the last row in therightDataFrame are selected where theon key is lessthan the left’s key. BothDataFrame must be sorted by the key.

Optionally anmerge_asof() can perform a group-wise merge by matching theby key in addition to the nearest match on theon key.

In [137]:trades=pd.DataFrame(   .....:{   .....:"time":pd.to_datetime(   .....:[   .....:"20160525 13:30:00.023",   .....:"20160525 13:30:00.038",   .....:"20160525 13:30:00.048",   .....:"20160525 13:30:00.048",   .....:"20160525 13:30:00.048",   .....:]   .....:),   .....:"ticker":["MSFT","MSFT","GOOG","GOOG","AAPL"],   .....:"price":[51.95,51.95,720.77,720.92,98.00],   .....:"quantity":[75,155,100,100,100],   .....:},   .....:columns=["time","ticker","price","quantity"],   .....:)   .....:In [138]:quotes=pd.DataFrame(   .....:{   .....:"time":pd.to_datetime(   .....:[   .....:"20160525 13:30:00.023",   .....:"20160525 13:30:00.023",   .....:"20160525 13:30:00.030",   .....:"20160525 13:30:00.041",   .....:"20160525 13:30:00.048",   .....:"20160525 13:30:00.049",   .....:"20160525 13:30:00.072",   .....:"20160525 13:30:00.075",   .....:]   .....:),   .....:"ticker":["GOOG","MSFT","MSFT","MSFT","GOOG","AAPL","GOOG","MSFT"],   .....:"bid":[720.50,51.95,51.97,51.99,720.50,97.99,720.50,52.01],   .....:"ask":[720.93,51.96,51.98,52.00,720.93,98.01,720.88,52.03],   .....:},   .....:columns=["time","ticker","bid","ask"],   .....:)   .....:In [139]:tradesOut[139]:                     time ticker   price  quantity0 2016-05-25 13:30:00.023   MSFT   51.95        751 2016-05-25 13:30:00.038   MSFT   51.95       1552 2016-05-25 13:30:00.048   GOOG  720.77       1003 2016-05-25 13:30:00.048   GOOG  720.92       1004 2016-05-25 13:30:00.048   AAPL   98.00       100In [140]:quotesOut[140]:                     time ticker     bid     ask0 2016-05-25 13:30:00.023   GOOG  720.50  720.931 2016-05-25 13:30:00.023   MSFT   51.95   51.962 2016-05-25 13:30:00.030   MSFT   51.97   51.983 2016-05-25 13:30:00.041   MSFT   51.99   52.004 2016-05-25 13:30:00.048   GOOG  720.50  720.935 2016-05-25 13:30:00.049   AAPL   97.99   98.016 2016-05-25 13:30:00.072   GOOG  720.50  720.887 2016-05-25 13:30:00.075   MSFT   52.01   52.03In [141]:pd.merge_asof(trades,quotes,on="time",by="ticker")Out[141]:                     time ticker   price  quantity     bid     ask0 2016-05-25 13:30:00.023   MSFT   51.95        75   51.95   51.961 2016-05-25 13:30:00.038   MSFT   51.95       155   51.97   51.982 2016-05-25 13:30:00.048   GOOG  720.77       100  720.50  720.933 2016-05-25 13:30:00.048   GOOG  720.92       100  720.50  720.934 2016-05-25 13:30:00.048   AAPL   98.00       100     NaN     NaN

merge_asof() within2ms between the quote time and the trade time.

In [142]:pd.merge_asof(trades,quotes,on="time",by="ticker",tolerance=pd.Timedelta("2ms"))Out[142]:                     time ticker   price  quantity     bid     ask0 2016-05-25 13:30:00.023   MSFT   51.95        75   51.95   51.961 2016-05-25 13:30:00.038   MSFT   51.95       155     NaN     NaN2 2016-05-25 13:30:00.048   GOOG  720.77       100  720.50  720.933 2016-05-25 13:30:00.048   GOOG  720.92       100  720.50  720.934 2016-05-25 13:30:00.048   AAPL   98.00       100     NaN     NaN

merge_asof() within10ms between the quote time and the trade time andexclude exact matches on time. Note that though we exclude the exact matches(of the quotes), prior quotesdo propagate to that point in time.

In [143]:pd.merge_asof(   .....:trades,   .....:quotes,   .....:on="time",   .....:by="ticker",   .....:tolerance=pd.Timedelta("10ms"),   .....:allow_exact_matches=False,   .....:)   .....:Out[143]:                     time ticker   price  quantity    bid    ask0 2016-05-25 13:30:00.023   MSFT   51.95        75    NaN    NaN1 2016-05-25 13:30:00.038   MSFT   51.95       155  51.97  51.982 2016-05-25 13:30:00.048   GOOG  720.77       100    NaN    NaN3 2016-05-25 13:30:00.048   GOOG  720.92       100    NaN    NaN4 2016-05-25 13:30:00.048   AAPL   98.00       100    NaN    NaN

`compare()`#

TheSeries.compare() andDataFrame.compare() methods allow you tocompare twoDataFrame orSeries, respectively, and summarize their differences.

In [144]:df=pd.DataFrame(   .....:{   .....:"col1":["a","a","b","b","a"],   .....:"col2":[1.0,2.0,3.0,np.nan,5.0],   .....:"col3":[1.0,2.0,3.0,4.0,5.0],   .....:},   .....:columns=["col1","col2","col3"],   .....:)   .....:In [145]:dfOut[145]:  col1  col2  col30    a   1.0   1.01    a   2.0   2.02    b   3.0   3.03    b   NaN   4.04    a   5.0   5.0In [146]:df2=df.copy()In [147]:df2.loc[0,"col1"]="c"In [148]:df2.loc[2,"col3"]=4.0In [149]:df2Out[149]:  col1  col2  col30    c   1.0   1.01    a   2.0   2.02    b   3.0   4.03    b   NaN   4.04    a   5.0   5.0In [150]:df.compare(df2)Out[150]:  col1       col3  self other self other0    a     c  NaN   NaN2  NaN   NaN  3.0   4.0

By default, if two corresponding values are equal, they will be shown asNaN.Furthermore, if all values in an entire row / column, the row / column will beomitted from the result. The remaining differences will be aligned on columns.

Stack the differences on rows.

In [151]:df.compare(df2,align_axis=0)Out[151]:        col1  col30 self     a   NaN  other    c   NaN2 self   NaN   3.0  other  NaN   4.0

Keep all original rows and columns withkeep_shape=True

In [152]:df.compare(df2,keep_shape=True)Out[152]:  col1       col2       col3  self other self other self other0    a     c  NaN   NaN  NaN   NaN1  NaN   NaN  NaN   NaN  NaN   NaN2  NaN   NaN  NaN   NaN  3.0   4.03  NaN   NaN  NaN   NaN  NaN   NaN4  NaN   NaN  NaN   NaN  NaN   NaN

Keep all the original values even if they are equal.

In [153]:df.compare(df2,keep_shape=True,keep_equal=True)Out[153]:  col1       col2       col3  self other self other self other0    a     c  1.0   1.0  1.0   1.01    a     a  2.0   2.0  2.0   2.02    b     b  3.0   3.0  3.0   4.03    b     b  NaN   NaN  4.0   4.04    a     a  5.0   5.0  5.0   5.0

On this page

Show Source

Movatterモバイル変換

Joining logic of the resulting axis#

Ignoring indexes on the concatenation axis#

ConcatenatingSeries andDataFrame together#

Resultingkeys#

Appending rows to aDataFrame#

Merge types#

Merge key uniqueness#

Merge result indicator#

Overlapping value columns#

Joining a single Index to a MultiIndex#

Joining with twoMultiIndex#

Merging on a combination of columns and index levels#

Joining multipleDataFrame#

Concatenating`Series` and`DataFrame` together#

Resulting`keys`#

Appending rows to a`DataFrame`#

Joining with two`MultiIndex`#

Joining multiple`DataFrame`#