Pandas merge with logic

Question 1

I would like to merge two dataframes, but can't exactly figure out how to do so without iterating. Basically, I want to merge a row from df2 to df1 if df1.date >= df2.start_date and df1.date <= df2.end_date. See example below:

df1:index   date         value0       2012-08-01   821       2012-08-02   202       2012-08-03   94...n-1     2012-10-29   58n       2012-10-30   73df2:index   start_date   end_date     other_value0       2012-08-01   2012-09-04   'foo'1       2012-09-05   2012-10-15   'bar'2       2012-10-16   2012-11-01   'foobar'...final_df:index   df2_index   date         value  other_value0       0           2012-08-01   82     'foo'1       0           2012-08-02   20     'foo'2       0           2012-08-03   94     'foo'...n-1     2           2012-10-29   58     'foobar'n       2           2012-10-30   73     'foobar'

I thought about creating a date series vector to merge with df2, so that I can combine on date, but it seems very manual and does not leverage the power/speed of pandas. I also thought about trying to expand df2 into single days, but couldn't find any way to do so without a manual / iteration type solution.

Question 2

Are the intervals defined bystart_date andend_date disjoint?

Question 3

The naive iterative approach isO(n*m), wheren = len(df1) andm = len(df2), since for each date indf1 you would have to check its inclusion in up tom intervals.

If the intervals defined bydf2 are disjoint, then there is a theoretically better way: usesearchsorted to find where each date indf1 fits amongst the start_dates, and then usesearchsorted a second time to find where each date fits amongst the end_dates. When the index from the two calls tosearchsorted are equal, the date falls inside an interval.

Searchsorted assumes the cutoff dates are sorted and it uses binary search, so each call has complexity O(n*log(m)).

Ifm is large enough, usingsearchsorted should be faster than the naive iterative approach.

Ifm is not large, the iterative approach may be faster.

Here is an example, usingsearchsorted:

import numpy as npimport pandas as pdTimestamp = pd.Timestampdf1 = pd.DataFrame({'date': (Timestamp('2012-08-01'),                             Timestamp('2012-08-02'),                             Timestamp('2012-08-03'),                             Timestamp('2012-10-29'),                             Timestamp('2012-10-30'),                             Timestamp('2012-11-01'),                             Timestamp('2012-10-15'),  # on then end_date                             Timestamp('2012-09-04'),  # outside an interval                             Timestamp('2012-09-05'),  # on then start_date                             ),                    'value': (82, 20, 94, 58, 73, 1, 2, 3, 4)})print(df1)df2 = pd.DataFrame({'end_date': (                        Timestamp('2012-10-15'),                        Timestamp('2012-09-04'),                        Timestamp('2012-11-01')),                    'other_value': ("foo", "bar", "foobar"),                    'start_date': (                        Timestamp('2012-09-05'),                        Timestamp('2012-08-01'),                        Timestamp('2012-10-16'))})df2 = df2.reindex(columns=['start_date', 'end_date', 'other_value'])df2.sort(['start_date'], inplace=True)print(df2)# Convert to DatetimeIndexes so we can call the searchsorted methoddate_idx = pd.DatetimeIndex(df1['date'])start_date_idx = pd.DatetimeIndex(df2['start_date'])# Add one to the end_date so the original end_date will be included in the# half-open interval.end_date_idx = pd.DatetimeIndex(df2['end_date'])+pd.DateOffset(days=1)start_idx = start_date_idx.searchsorted(date_idx, side='right')-1end_idx = end_date_idx.searchsorted(date_idx, side='right')df1['idx'] = np.where(start_idx == end_idx, end_idx, np.nan)result = pd.merge(df1, df2, left_on=['idx'], right_index=True)result = result.reindex(columns=['idx', 'date', 'value', 'other_value'])print(result)

Withdf1 equal to

        date  value0 2012-08-01     821 2012-08-02     202 2012-08-03     943 2012-10-29     584 2012-10-30     735 2012-11-01      16 2012-10-15      27 2012-09-04      38 2012-09-05      4

anddf2 equal to

  start_date   end_date other_value1 2012-08-01 2012-09-04         bar0 2012-09-05 2012-10-15         foo2 2012-10-16 2012-11-01      foobar

the above code yields

   idx       date  value other_value0    0 2012-08-01     82         foo1    0 2012-08-02     20         foo2    0 2012-08-03     94         foo7    0 2012-09-04      3         foo3    2 2012-10-29     58      foobar4    2 2012-10-30     73      foobar5    2 2012-11-01      1      foobar6    1 2012-10-15      2         bar8    1 2012-09-05      4         bar

Question 4

Wow this is fast. I thought my naive implementation was relatively fast, I sorted both dataframes and held the index of df1 and df2 so that I was at O(m+n). I was getting ~1.5 seconds for len(df1)=720 and len(df2)=25, your implementation is clocking in at .002 seconds. Maybe the inclusion step was the bottleneck? Either way, thank you so much. Is there a good place to find optimizations like this one or do you just know this from experience?

Question 5

Much of what I know comes from seeing others do similar things. I don't have a citation for this particular trick, but once you are aware of the existence ofsearchsorted, the application follows pretty naturally.

unutbu 886k197 gold badges1.9k silver badges1.7k bronze badges · Accepted Answer · 2014-08-04 21:13:19Z

The naive iterative approach isO(n*m), wheren = len(df1) andm = len(df2), since for each date indf1 you would have to check its inclusion in up tom intervals.

If the intervals defined bydf2 are disjoint, then there is a theoretically better way: usesearchsorted to find where each date indf1 fits amongst the start_dates, and then usesearchsorted a second time to find where each date fits amongst the end_dates. When the index from the two calls tosearchsorted are equal, the date falls inside an interval.

Searchsorted assumes the cutoff dates are sorted and it uses binary search, so each call has complexity O(n*log(m)).

Ifm is large enough, usingsearchsorted should be faster than the naive iterative approach.

Ifm is not large, the iterative approach may be faster.

Here is an example, usingsearchsorted:

import numpy as npimport pandas as pdTimestamp = pd.Timestampdf1 = pd.DataFrame({'date': (Timestamp('2012-08-01'),                             Timestamp('2012-08-02'),                             Timestamp('2012-08-03'),                             Timestamp('2012-10-29'),                             Timestamp('2012-10-30'),                             Timestamp('2012-11-01'),                             Timestamp('2012-10-15'),  # on then end_date                             Timestamp('2012-09-04'),  # outside an interval                             Timestamp('2012-09-05'),  # on then start_date                             ),                    'value': (82, 20, 94, 58, 73, 1, 2, 3, 4)})print(df1)df2 = pd.DataFrame({'end_date': (                        Timestamp('2012-10-15'),                        Timestamp('2012-09-04'),                        Timestamp('2012-11-01')),                    'other_value': ("foo", "bar", "foobar"),                    'start_date': (                        Timestamp('2012-09-05'),                        Timestamp('2012-08-01'),                        Timestamp('2012-10-16'))})df2 = df2.reindex(columns=['start_date', 'end_date', 'other_value'])df2.sort(['start_date'], inplace=True)print(df2)# Convert to DatetimeIndexes so we can call the searchsorted methoddate_idx = pd.DatetimeIndex(df1['date'])start_date_idx = pd.DatetimeIndex(df2['start_date'])# Add one to the end_date so the original end_date will be included in the# half-open interval.end_date_idx = pd.DatetimeIndex(df2['end_date'])+pd.DateOffset(days=1)start_idx = start_date_idx.searchsorted(date_idx, side='right')-1end_idx = end_date_idx.searchsorted(date_idx, side='right')df1['idx'] = np.where(start_idx == end_idx, end_idx, np.nan)result = pd.merge(df1, df2, left_on=['idx'], right_index=True)result = result.reindex(columns=['idx', 'date', 'value', 'other_value'])print(result)

Withdf1 equal to

        date  value0 2012-08-01     821 2012-08-02     202 2012-08-03     943 2012-10-29     584 2012-10-30     735 2012-11-01      16 2012-10-15      27 2012-09-04      38 2012-09-05      4

anddf2 equal to

  start_date   end_date other_value1 2012-08-01 2012-09-04         bar0 2012-09-05 2012-10-15         foo2 2012-10-16 2012-11-01      foobar

the above code yields

   idx       date  value other_value0    0 2012-08-01     82         foo1    0 2012-08-02     20         foo2    0 2012-08-03     94         foo7    0 2012-09-04      3         foo3    2 2012-10-29     58      foobar4    2 2012-10-30     73      foobar5    2 2012-11-01      1      foobar6    1 2012-10-15      2         bar8    1 2012-09-05      4         bar

Wow this is fast. I thought my naive implementation was relatively fast, I sorted both dataframes and held the index of df1 and df2 so that I was at O(m+n). I was getting ~1.5 seconds for len(df1)=720 and len(df2)=25, your implementation is clocking in at .002 seconds. Maybe the inclusion step was the bottleneck? Either way, thank you so much. Is there a good place to find optimizations like this one or do you just know this from experience?
Much of what I know comes from seeing others do similar things. I don't have a citation for this particular trick, but once you are aware of the existence ofsearchsorted, the application follows pretty naturally.

Movatterモバイル変換

Collectives™ on Stack Overflow

Pandas merge with logic

1 Answer1

2 Comments

Your Answer

Sign up orlog in

Post as a guest

Related

Hot Network Questions

Subscribe to RSS