I would like to merge two dataframes, but can't exactly figure out how to do so without iterating. Basically, I want to merge a row from df2 to df1 if df1.date >= df2.start_date and df1.date <= df2.end_date. See example below:
df1:index date value0 2012-08-01 821 2012-08-02 202 2012-08-03 94...n-1 2012-10-29 58n 2012-10-30 73df2:index start_date end_date other_value0 2012-08-01 2012-09-04 'foo'1 2012-09-05 2012-10-15 'bar'2 2012-10-16 2012-11-01 'foobar'...final_df:index df2_index date value other_value0 0 2012-08-01 82 'foo'1 0 2012-08-02 20 'foo'2 0 2012-08-03 94 'foo'...n-1 2 2012-10-29 58 'foobar'n 2 2012-10-30 73 'foobar'I thought about creating a date series vector to merge with df2, so that I can combine on date, but it seems very manual and does not leverage the power/speed of pandas. I also thought about trying to expand df2 into single days, but couldn't find any way to do so without a manual / iteration type solution.
- Are the intervals defined by
start_dateandend_datedisjoint?unutbu– unutbu2014-08-04 20:54:40 +00:00CommentedAug 4, 2014 at 20:54
1 Answer1
The naive iterative approach isO(n*m), wheren = len(df1) andm = len(df2), since for each date indf1 you would have to check its inclusion in up tom intervals.
If the intervals defined bydf2 are disjoint, then there is a theoretically better way: usesearchsorted to find where each date indf1 fits amongst the start_dates, and then usesearchsorted a second time to find where each date fits amongst the end_dates. When the index from the two calls tosearchsorted are equal, the date falls inside an interval.
Searchsorted assumes the cutoff dates are sorted and it uses binary search, so each call has complexity O(n*log(m)).
Ifm is large enough, usingsearchsorted should be faster than the naive iterative approach.
Ifm is not large, the iterative approach may be faster.
Here is an example, usingsearchsorted:
import numpy as npimport pandas as pdTimestamp = pd.Timestampdf1 = pd.DataFrame({'date': (Timestamp('2012-08-01'), Timestamp('2012-08-02'), Timestamp('2012-08-03'), Timestamp('2012-10-29'), Timestamp('2012-10-30'), Timestamp('2012-11-01'), Timestamp('2012-10-15'), # on then end_date Timestamp('2012-09-04'), # outside an interval Timestamp('2012-09-05'), # on then start_date ), 'value': (82, 20, 94, 58, 73, 1, 2, 3, 4)})print(df1)df2 = pd.DataFrame({'end_date': ( Timestamp('2012-10-15'), Timestamp('2012-09-04'), Timestamp('2012-11-01')), 'other_value': ("foo", "bar", "foobar"), 'start_date': ( Timestamp('2012-09-05'), Timestamp('2012-08-01'), Timestamp('2012-10-16'))})df2 = df2.reindex(columns=['start_date', 'end_date', 'other_value'])df2.sort(['start_date'], inplace=True)print(df2)# Convert to DatetimeIndexes so we can call the searchsorted methoddate_idx = pd.DatetimeIndex(df1['date'])start_date_idx = pd.DatetimeIndex(df2['start_date'])# Add one to the end_date so the original end_date will be included in the# half-open interval.end_date_idx = pd.DatetimeIndex(df2['end_date'])+pd.DateOffset(days=1)start_idx = start_date_idx.searchsorted(date_idx, side='right')-1end_idx = end_date_idx.searchsorted(date_idx, side='right')df1['idx'] = np.where(start_idx == end_idx, end_idx, np.nan)result = pd.merge(df1, df2, left_on=['idx'], right_index=True)result = result.reindex(columns=['idx', 'date', 'value', 'other_value'])print(result)Withdf1 equal to
date value0 2012-08-01 821 2012-08-02 202 2012-08-03 943 2012-10-29 584 2012-10-30 735 2012-11-01 16 2012-10-15 27 2012-09-04 38 2012-09-05 4anddf2 equal to
start_date end_date other_value1 2012-08-01 2012-09-04 bar0 2012-09-05 2012-10-15 foo2 2012-10-16 2012-11-01 foobarthe above code yields
idx date value other_value0 0 2012-08-01 82 foo1 0 2012-08-02 20 foo2 0 2012-08-03 94 foo7 0 2012-09-04 3 foo3 2 2012-10-29 58 foobar4 2 2012-10-30 73 foobar5 2 2012-11-01 1 foobar6 1 2012-10-15 2 bar8 1 2012-09-05 4 bar2 Comments
searchsorted, the application follows pretty naturally.Explore related questions
See similar questions with these tags.