Enter search terms or a module, class or function name.
Since many potential pandas users have some familiarity withSQL, this page is meant to provide some examples of howvarious SQL operations would be performed using pandas.
If you’re new to pandas, you might want to first read through10 Minutes to pandasto familiarize yourself with the library.
As is customary, we import pandas and numpy as follows:
In [1]:importpandasaspdIn [2]:importnumpyasnp
Most of the examples will utilize thetips dataset found within pandas tests. We’ll readthe data into a DataFrame calledtips and assume we have a database table of the same name andstructure.
In [3]:url='https://raw.github.com/pandas-dev/pandas/master/pandas/tests/data/tips.csv'In [4]:tips=pd.read_csv(url)In [5]:tips.head()Out[5]: total_bill tip sex smoker day time size0 16.99 1.01 Female No Sun Dinner 21 10.34 1.66 Male No Sun Dinner 32 21.01 3.50 Male No Sun Dinner 33 23.68 3.31 Male No Sun Dinner 24 24.59 3.61 Female No Sun Dinner 4
In SQL, selection is done using a comma-separated list of columns you’d like to select (or a*to select all columns):
SELECTtotal_bill,tip,smoker,timeFROMtipsLIMIT5;
With pandas, column selection is done by passing a list of column names to your DataFrame:
In [6]:tips[['total_bill','tip','smoker','time']].head(5)Out[6]: total_bill tip smoker time0 16.99 1.01 No Dinner1 10.34 1.66 No Dinner2 21.01 3.50 No Dinner3 23.68 3.31 No Dinner4 24.59 3.61 No Dinner
Calling the DataFrame without the list of column names would display all columns (akin to SQL’s*).
Filtering in SQL is done via a WHERE clause.
SELECT*FROMtipsWHEREtime='Dinner'LIMIT5;
DataFrames can be filtered in multiple ways; the most intuitive of which is usingboolean indexing.
In [7]:tips[tips['time']=='Dinner'].head(5)Out[7]: total_bill tip sex smoker day time size0 16.99 1.01 Female No Sun Dinner 21 10.34 1.66 Male No Sun Dinner 32 21.01 3.50 Male No Sun Dinner 33 23.68 3.31 Male No Sun Dinner 24 24.59 3.61 Female No Sun Dinner 4
The above statement is simply passing aSeries of True/False objects to the DataFrame,returning all rows with True.
In [8]:is_dinner=tips['time']=='Dinner'In [9]:is_dinner.value_counts()Out[9]:True 176False 68Name: time, dtype: int64In [10]:tips[is_dinner].head(5)Out[10]: total_bill tip sex smoker day time size0 16.99 1.01 Female No Sun Dinner 21 10.34 1.66 Male No Sun Dinner 32 21.01 3.50 Male No Sun Dinner 33 23.68 3.31 Male No Sun Dinner 24 24.59 3.61 Female No Sun Dinner 4
Just like SQL’s OR and AND, multiple conditions can be passed to a DataFrame using | (OR) and &(AND).
-- tips of more than $5.00 at Dinner mealsSELECT*FROMtipsWHEREtime='Dinner'ANDtip>5.00;
# tips of more than $5.00 at Dinner mealsIn [11]:tips[(tips['time']=='Dinner')&(tips['tip']>5.00)]Out[11]: total_bill tip sex smoker day time size23 39.42 7.58 Male No Sat Dinner 444 30.40 5.60 Male No Sun Dinner 447 32.40 6.00 Male No Sun Dinner 452 34.81 5.20 Female No Sun Dinner 459 48.27 6.73 Male No Sat Dinner 4116 29.93 5.07 Male No Sun Dinner 4155 29.85 5.14 Female No Sun Dinner 5170 50.81 10.00 Male Yes Sat Dinner 3172 7.25 5.15 Male Yes Sun Dinner 2181 23.33 5.65 Male Yes Sun Dinner 2183 23.17 6.50 Male Yes Sun Dinner 4211 25.89 5.16 Male Yes Sat Dinner 4212 48.33 9.00 Male No Sat Dinner 4214 28.17 6.50 Female Yes Sat Dinner 3239 29.03 5.92 Male No Sat Dinner 3
-- tips by parties of at least 5 diners OR bill total was more than $45SELECT*FROMtipsWHEREsize>=5ORtotal_bill>45;
# tips by parties of at least 5 diners OR bill total was more than $45In [12]:tips[(tips['size']>=5)|(tips['total_bill']>45)]Out[12]: total_bill tip sex smoker day time size59 48.27 6.73 Male No Sat Dinner 4125 29.80 4.20 Female No Thur Lunch 6141 34.30 6.70 Male No Thur Lunch 6142 41.19 5.00 Male No Thur Lunch 5143 27.05 5.00 Female No Thur Lunch 6155 29.85 5.14 Female No Sun Dinner 5156 48.17 5.00 Male No Sun Dinner 6170 50.81 10.00 Male Yes Sat Dinner 3182 45.35 3.50 Male Yes Sun Dinner 3185 20.69 5.00 Male No Sun Dinner 5187 30.46 2.00 Male Yes Sun Dinner 5212 48.33 9.00 Male No Sat Dinner 4216 28.15 3.00 Male Yes Sat Dinner 5
NULL checking is done using thenotnull() andisnull()methods.
In [13]:frame=pd.DataFrame({'col1':['A','B',np.NaN,'C','D'], ....:'col2':['F',np.NaN,'G','H','I']}) ....:In [14]:frameOut[14]: col1 col20 A F1 B NaN2 NaN G3 C H4 D I
Assume we have a table of the same structure as our DataFrame above. We can see only the recordswherecol2 IS NULL with the following query:
SELECT*FROMframeWHEREcol2ISNULL;
In [15]:frame[frame['col2'].isnull()]Out[15]: col1 col21 B NaN
Getting items wherecol1 IS NOT NULL can be done withnotnull().
SELECT*FROMframeWHEREcol1ISNOTNULL;
In [16]:frame[frame['col1'].notnull()]Out[16]: col1 col20 A F1 B NaN3 C H4 D I
In pandas, SQL’s GROUP BY operations are performed using the similarly namedgroupby() method.groupby() typically refers to aprocess where we’d like to split a dataset into groups, apply some function (typically aggregation), and then combine the groups together.
A common SQL operation would be getting the count of records in each group throughout a dataset.For instance, a query getting us the number of tips left by sex:
SELECTsex,count(*)FROMtipsGROUPBYsex;/*Female 87Male 157*/
The pandas equivalent would be:
In [17]:tips.groupby('sex').size()Out[17]:sexFemale 87Male 157dtype: int64
Notice that in the pandas code we usedsize() and notcount(). This is becausecount() applies the function to each column, returningthe number ofnotnull records within each.
In [18]:tips.groupby('sex').count()Out[18]: total_bill tip smoker day time sizesexFemale 87 87 87 87 87 87Male 157 157 157 157 157 157
Alternatively, we could have applied thecount() methodto an individual column:
In [19]:tips.groupby('sex')['total_bill'].count()Out[19]:sexFemale 87Male 157Name: total_bill, dtype: int64
Multiple functions can also be applied at once. For instance, say we’d like to see how tip amountdiffers by day of the week -agg() allows you to pass a dictionaryto your grouped DataFrame, indicating which functions to apply to specific columns.
SELECTday,AVG(tip),COUNT(*)FROMtipsGROUPBYday;/*Fri 2.734737 19Sat 2.993103 87Sun 3.255132 76Thur 2.771452 62*/
In [20]:tips.groupby('day').agg({'tip':np.mean,'day':np.size})Out[20]: tip daydayFri 2.734737 19Sat 2.993103 87Sun 3.255132 76Thur 2.771452 62
Grouping by more than one column is done by passing a list of columns to thegroupby() method.
SELECTsmoker,day,COUNT(*),AVG(tip)FROMtipsGROUPBYsmoker,day;/*smoker dayNo Fri 4 2.812500 Sat 45 3.102889 Sun 57 3.167895 Thur 45 2.673778Yes Fri 15 2.714000 Sat 42 2.875476 Sun 19 3.516842 Thur 17 3.030000*/
In [21]:tips.groupby(['smoker','day']).agg({'tip':[np.size,np.mean]})Out[21]: tip size meansmoker dayNo Fri 4.0 2.812500 Sat 45.0 3.102889 Sun 57.0 3.167895 Thur 45.0 2.673778Yes Fri 15.0 2.714000 Sat 42.0 2.875476 Sun 19.0 3.516842 Thur 17.0 3.030000
JOINs can be performed withjoin() ormerge(). By default,join() will join the DataFrames on their indices. Each method hasparameters allowing you to specify the type of join to perform (LEFT, RIGHT, INNER, FULL) or thecolumns to join on (column names or indices).
In [22]:df1=pd.DataFrame({'key':['A','B','C','D'], ....:'value':np.random.randn(4)}) ....:In [23]:df2=pd.DataFrame({'key':['B','D','D','E'], ....:'value':np.random.randn(4)}) ....:
Assume we have two database tables of the same name and structure as our DataFrames.
Now let’s go over the various types of JOINs.
SELECT*FROMdf1INNERJOINdf2ONdf1.key=df2.key;
# merge performs an INNER JOIN by defaultIn [24]:pd.merge(df1,df2,on='key')Out[24]: key value_x value_y0 B -0.318214 0.5435811 D 2.169960 -0.4260672 D 2.169960 1.138079
merge() also offers parameters for cases when you’d like to join one DataFrame’scolumn with another DataFrame’s index.
In [25]:indexed_df2=df2.set_index('key')In [26]:pd.merge(df1,indexed_df2,left_on='key',right_index=True)Out[26]: key value_x value_y1 B -0.318214 0.5435813 D 2.169960 -0.4260673 D 2.169960 1.138079
-- show all records from df1SELECT*FROMdf1LEFTOUTERJOINdf2ONdf1.key=df2.key;
# show all records from df1In [27]:pd.merge(df1,df2,on='key',how='left')Out[27]: key value_x value_y0 A 0.116174 NaN1 B -0.318214 0.5435812 C 0.285261 NaN3 D 2.169960 -0.4260674 D 2.169960 1.138079
-- show all records from df2SELECT*FROMdf1RIGHTOUTERJOINdf2ONdf1.key=df2.key;
# show all records from df2In [28]:pd.merge(df1,df2,on='key',how='right')Out[28]: key value_x value_y0 B -0.318214 0.5435811 D 2.169960 -0.4260672 D 2.169960 1.1380793 E NaN 0.086073
pandas also allows for FULL JOINs, which display both sides of the dataset, whether or not thejoined columns find a match. As of writing, FULL JOINs are not supported in all RDBMS (MySQL).
-- show all records from both tablesSELECT*FROMdf1FULLOUTERJOINdf2ONdf1.key=df2.key;
# show all records from both framesIn [29]:pd.merge(df1,df2,on='key',how='outer')Out[29]: key value_x value_y0 A 0.116174 NaN1 B -0.318214 0.5435812 C 0.285261 NaN3 D 2.169960 -0.4260674 D 2.169960 1.1380795 E NaN 0.086073
UNION ALL can be performed usingconcat().
In [30]:df1=pd.DataFrame({'city':['Chicago','San Francisco','New York City'], ....:'rank':range(1,4)}) ....:In [31]:df2=pd.DataFrame({'city':['Chicago','Boston','Los Angeles'], ....:'rank':[1,4,5]}) ....:
SELECTcity,rankFROMdf1UNIONALLSELECTcity,rankFROMdf2;/* city rank Chicago 1San Francisco 2New York City 3 Chicago 1 Boston 4 Los Angeles 5*/
In [32]:pd.concat([df1,df2])Out[32]: city rank0 Chicago 11 San Francisco 22 New York City 30 Chicago 11 Boston 42 Los Angeles 5
SQL’s UNION is similar to UNION ALL, however UNION will remove duplicate rows.
SELECTcity,rankFROMdf1UNIONSELECTcity,rankFROMdf2;-- notice that there is only one Chicago record this time/* city rank Chicago 1San Francisco 2New York City 3 Boston 4 Los Angeles 5*/
In pandas, you can useconcat() in conjunction withdrop_duplicates().
In [33]:pd.concat([df1,df2]).drop_duplicates()Out[33]: city rank0 Chicago 11 San Francisco 22 New York City 31 Boston 42 Los Angeles 5
-- MySQLSELECT*FROMtipsORDERBYtipDESCLIMIT10OFFSET5;
In [34]:tips.nlargest(10+5,columns='tip').tail(10)Out[34]: total_bill tip sex smoker day time size183 23.17 6.50 Male Yes Sun Dinner 4214 28.17 6.50 Female Yes Sat Dinner 347 32.40 6.00 Male No Sun Dinner 4239 29.03 5.92 Male No Sat Dinner 388 24.71 5.85 Male No Thur Lunch 2181 23.33 5.65 Male Yes Sun Dinner 244 30.40 5.60 Male No Sun Dinner 452 34.81 5.20 Female No Sun Dinner 485 34.83 5.17 Female No Thur Lunch 4211 25.89 5.16 Male Yes Sat Dinner 4
-- Oracle's ROW_NUMBER() analytic functionSELECT*FROM(SELECTt.*,ROW_NUMBER()OVER(PARTITIONBYdayORDERBYtotal_billDESC)ASrnFROMtipst)WHERErn<3ORDERBYday,rn;
In [35]:(tips.assign(rn=tips.sort_values(['total_bill'],ascending=False) ....:.groupby(['day']) ....:.cumcount()+1) ....:.query('rn < 3') ....:.sort_values(['day','rn']) ....:) ....:Out[35]: total_bill tip sex smoker day time size rn95 40.17 4.73 Male Yes Fri Dinner 4 190 28.97 3.00 Male Yes Fri Dinner 2 2170 50.81 10.00 Male Yes Sat Dinner 3 1212 48.33 9.00 Male No Sat Dinner 4 2156 48.17 5.00 Male No Sun Dinner 6 1182 45.35 3.50 Male Yes Sun Dinner 3 2197 43.11 5.00 Female Yes Thur Lunch 4 1142 41.19 5.00 Male No Thur Lunch 5 2
the same usingrank(method=’first’) function
In [36]:(tips.assign(rnk=tips.groupby(['day'])['total_bill'] ....:.rank(method='first',ascending=False)) ....:.query('rnk < 3') ....:.sort_values(['day','rnk']) ....:) ....:Out[36]: total_bill tip sex smoker day time size rnk95 40.17 4.73 Male Yes Fri Dinner 4 1.090 28.97 3.00 Male Yes Fri Dinner 2 2.0170 50.81 10.00 Male Yes Sat Dinner 3 1.0212 48.33 9.00 Male No Sat Dinner 4 2.0156 48.17 5.00 Male No Sun Dinner 6 1.0182 45.35 3.50 Male Yes Sun Dinner 3 2.0197 43.11 5.00 Female Yes Thur Lunch 4 1.0142 41.19 5.00 Male No Thur Lunch 5 2.0
-- Oracle's RANK() analytic functionSELECT*FROM(SELECTt.*,RANK()OVER(PARTITIONBYsexORDERBYtip)ASrnkFROMtipstWHEREtip<2)WHERErnk<3ORDERBYsex,rnk;
Let’s find tips with (rank < 3) per gender group for (tips < 2).Notice that when usingrank(method='min') functionrnk_min remains the same for the sametip(as Oracle’s RANK() function)
In [37]:(tips[tips['tip']<2] ....:.assign(rnk_min=tips.groupby(['sex'])['tip'] ....:.rank(method='min')) ....:.query('rnk_min < 3') ....:.sort_values(['sex','rnk_min']) ....:) ....:Out[37]: total_bill tip sex smoker day time size rnk_min67 3.07 1.00 Female Yes Sat Dinner 1 1.092 5.75 1.00 Female Yes Fri Dinner 2 1.0111 7.25 1.00 Female No Sat Dinner 1 1.0236 12.60 1.00 Male Yes Sat Dinner 2 1.0237 32.83 1.17 Male Yes Sat Dinner 2 2.0
DELETEFROMtipsWHEREtip>9;
In pandas we select the rows that should remain, instead of deleting them
In [39]:tips=tips.loc[tips['tip']<=9]