Search

Enter search terms or a module, class or function name.

Comparison with SQL¶

Since many potential pandas users have some familiarity withSQL, this page is meant to provide some examples of howvarious SQL operations would be performed using pandas.

If you’re new to pandas, you might want to first read through10 Minutes to pandasto familiarize yourself with the library.

As is customary, we import pandas and numpy as follows:

In [1]:importpandasaspdIn [2]:importnumpyasnp

Most of the examples will utilize thetips dataset found within pandas tests. We’ll readthe data into a DataFrame calledtips and assume we have a database table of the same name andstructure.

In [3]:url='https://raw.github.com/pandas-dev/pandas/master/pandas/tests/data/tips.csv'In [4]:tips=pd.read_csv(url)In [5]:tips.head()Out[5]:   total_bill   tip     sex smoker  day    time  size0       16.99  1.01  Female     No  Sun  Dinner     21       10.34  1.66    Male     No  Sun  Dinner     32       21.01  3.50    Male     No  Sun  Dinner     33       23.68  3.31    Male     No  Sun  Dinner     24       24.59  3.61  Female     No  Sun  Dinner     4

SELECT¶

In SQL, selection is done using a comma-separated list of columns you’d like to select (or a*to select all columns):

SELECTtotal_bill,tip,smoker,timeFROMtipsLIMIT5;

With pandas, column selection is done by passing a list of column names to your DataFrame:

In [6]:tips[['total_bill','tip','smoker','time']].head(5)Out[6]:   total_bill   tip smoker    time0       16.99  1.01     No  Dinner1       10.34  1.66     No  Dinner2       21.01  3.50     No  Dinner3       23.68  3.31     No  Dinner4       24.59  3.61     No  Dinner

Calling the DataFrame without the list of column names would display all columns (akin to SQL’s*).

WHERE¶

Filtering in SQL is done via a WHERE clause.

SELECT*FROMtipsWHEREtime='Dinner'LIMIT5;

DataFrames can be filtered in multiple ways; the most intuitive of which is usingboolean indexing.

In [7]:tips[tips['time']=='Dinner'].head(5)Out[7]:   total_bill   tip     sex smoker  day    time  size0       16.99  1.01  Female     No  Sun  Dinner     21       10.34  1.66    Male     No  Sun  Dinner     32       21.01  3.50    Male     No  Sun  Dinner     33       23.68  3.31    Male     No  Sun  Dinner     24       24.59  3.61  Female     No  Sun  Dinner     4

The above statement is simply passing aSeries of True/False objects to the DataFrame,returning all rows with True.

In [8]:is_dinner=tips['time']=='Dinner'In [9]:is_dinner.value_counts()Out[9]:True     176False     68Name: time, dtype: int64In [10]:tips[is_dinner].head(5)Out[10]:   total_bill   tip     sex smoker  day    time  size0       16.99  1.01  Female     No  Sun  Dinner     21       10.34  1.66    Male     No  Sun  Dinner     32       21.01  3.50    Male     No  Sun  Dinner     33       23.68  3.31    Male     No  Sun  Dinner     24       24.59  3.61  Female     No  Sun  Dinner     4

Just like SQL’s OR and AND, multiple conditions can be passed to a DataFrame using | (OR) and &(AND).

-- tips of more than $5.00 at Dinner mealsSELECT*FROMtipsWHEREtime='Dinner'ANDtip>5.00;

# tips of more than $5.00 at Dinner mealsIn [11]:tips[(tips['time']=='Dinner')&(tips['tip']>5.00)]Out[11]:     total_bill    tip     sex smoker  day    time  size23        39.42   7.58    Male     No  Sat  Dinner     444        30.40   5.60    Male     No  Sun  Dinner     447        32.40   6.00    Male     No  Sun  Dinner     452        34.81   5.20  Female     No  Sun  Dinner     459        48.27   6.73    Male     No  Sat  Dinner     4116       29.93   5.07    Male     No  Sun  Dinner     4155       29.85   5.14  Female     No  Sun  Dinner     5170       50.81  10.00    Male    Yes  Sat  Dinner     3172        7.25   5.15    Male    Yes  Sun  Dinner     2181       23.33   5.65    Male    Yes  Sun  Dinner     2183       23.17   6.50    Male    Yes  Sun  Dinner     4211       25.89   5.16    Male    Yes  Sat  Dinner     4212       48.33   9.00    Male     No  Sat  Dinner     4214       28.17   6.50  Female    Yes  Sat  Dinner     3239       29.03   5.92    Male     No  Sat  Dinner     3

-- tips by parties of at least 5 diners OR bill total was more than $45SELECT*FROMtipsWHEREsize>=5ORtotal_bill>45;

# tips by parties of at least 5 diners OR bill total was more than $45In [12]:tips[(tips['size']>=5)|(tips['total_bill']>45)]Out[12]:     total_bill    tip     sex smoker   day    time  size59        48.27   6.73    Male     No   Sat  Dinner     4125       29.80   4.20  Female     No  Thur   Lunch     6141       34.30   6.70    Male     No  Thur   Lunch     6142       41.19   5.00    Male     No  Thur   Lunch     5143       27.05   5.00  Female     No  Thur   Lunch     6155       29.85   5.14  Female     No   Sun  Dinner     5156       48.17   5.00    Male     No   Sun  Dinner     6170       50.81  10.00    Male    Yes   Sat  Dinner     3182       45.35   3.50    Male    Yes   Sun  Dinner     3185       20.69   5.00    Male     No   Sun  Dinner     5187       30.46   2.00    Male    Yes   Sun  Dinner     5212       48.33   9.00    Male     No   Sat  Dinner     4216       28.15   3.00    Male    Yes   Sat  Dinner     5

NULL checking is done using thenotnull() andisnull()methods.

In [13]:frame=pd.DataFrame({'col1':['A','B',np.NaN,'C','D'],   ....:'col2':['F',np.NaN,'G','H','I']})   ....:In [14]:frameOut[14]:  col1 col20    A    F1    B  NaN2  NaN    G3    C    H4    D    I

Assume we have a table of the same structure as our DataFrame above. We can see only the recordswherecol2 IS NULL with the following query:

SELECT*FROMframeWHEREcol2ISNULL;

In [15]:frame[frame['col2'].isnull()]Out[15]:  col1 col21    B  NaN

Getting items wherecol1 IS NOT NULL can be done withnotnull().

SELECT*FROMframeWHEREcol1ISNOTNULL;

In [16]:frame[frame['col1'].notnull()]Out[16]:  col1 col20    A    F1    B  NaN3    C    H4    D    I

GROUP BY¶

In pandas, SQL’s GROUP BY operations are performed using the similarly namedgroupby() method.groupby() typically refers to aprocess where we’d like to split a dataset into groups, apply some function (typically aggregation), and then combine the groups together.

A common SQL operation would be getting the count of records in each group throughout a dataset.For instance, a query getting us the number of tips left by sex:

SELECTsex,count(*)FROMtipsGROUPBYsex;/*Female     87Male      157*/

The pandas equivalent would be:

In [17]:tips.groupby('sex').size()Out[17]:sexFemale     87Male      157dtype: int64

Notice that in the pandas code we usedsize() and notcount(). This is becausecount() applies the function to each column, returningthe number ofnotnull records within each.

In [18]:tips.groupby('sex').count()Out[18]:        total_bill  tip  smoker  day  time  sizesexFemale          87   87      87   87    87    87Male           157  157     157  157   157   157

Alternatively, we could have applied thecount() methodto an individual column:

In [19]:tips.groupby('sex')['total_bill'].count()Out[19]:sexFemale     87Male      157Name: total_bill, dtype: int64

Multiple functions can also be applied at once. For instance, say we’d like to see how tip amountdiffers by day of the week -agg() allows you to pass a dictionaryto your grouped DataFrame, indicating which functions to apply to specific columns.

SELECTday,AVG(tip),COUNT(*)FROMtipsGROUPBYday;/*Fri   2.734737   19Sat   2.993103   87Sun   3.255132   76Thur  2.771452   62*/

In [20]:tips.groupby('day').agg({'tip':np.mean,'day':np.size})Out[20]:           tip  daydayFri   2.734737   19Sat   2.993103   87Sun   3.255132   76Thur  2.771452   62

Grouping by more than one column is done by passing a list of columns to thegroupby() method.

SELECTsmoker,day,COUNT(*),AVG(tip)FROMtipsGROUPBYsmoker,day;/*smoker dayNo     Fri      4  2.812500       Sat     45  3.102889       Sun     57  3.167895       Thur    45  2.673778Yes    Fri     15  2.714000       Sat     42  2.875476       Sun     19  3.516842       Thur    17  3.030000*/

In [21]:tips.groupby(['smoker','day']).agg({'tip':[np.size,np.mean]})Out[21]:              tip             size      meansmoker dayNo     Fri    4.0  2.812500       Sat   45.0  3.102889       Sun   57.0  3.167895       Thur  45.0  2.673778Yes    Fri   15.0  2.714000       Sat   42.0  2.875476       Sun   19.0  3.516842       Thur  17.0  3.030000

JOIN¶

JOINs can be performed withjoin() ormerge(). By default,join() will join the DataFrames on their indices. Each method hasparameters allowing you to specify the type of join to perform (LEFT, RIGHT, INNER, FULL) or thecolumns to join on (column names or indices).

In [22]:df1=pd.DataFrame({'key':['A','B','C','D'],   ....:'value':np.random.randn(4)})   ....:In [23]:df2=pd.DataFrame({'key':['B','D','D','E'],   ....:'value':np.random.randn(4)})   ....:

Assume we have two database tables of the same name and structure as our DataFrames.

Now let’s go over the various types of JOINs.

INNER JOIN¶

SELECT*FROMdf1INNERJOINdf2ONdf1.key=df2.key;

# merge performs an INNER JOIN by defaultIn [24]:pd.merge(df1,df2,on='key')Out[24]:  key   value_x   value_y0   B -0.318214  0.5435811   D  2.169960 -0.4260672   D  2.169960  1.138079

merge() also offers parameters for cases when you’d like to join one DataFrame’scolumn with another DataFrame’s index.

In [25]:indexed_df2=df2.set_index('key')In [26]:pd.merge(df1,indexed_df2,left_on='key',right_index=True)Out[26]:  key   value_x   value_y1   B -0.318214  0.5435813   D  2.169960 -0.4260673   D  2.169960  1.138079

LEFT OUTER JOIN¶

-- show all records from df1SELECT*FROMdf1LEFTOUTERJOINdf2ONdf1.key=df2.key;

# show all records from df1In [27]:pd.merge(df1,df2,on='key',how='left')Out[27]:  key   value_x   value_y0   A  0.116174       NaN1   B -0.318214  0.5435812   C  0.285261       NaN3   D  2.169960 -0.4260674   D  2.169960  1.138079

RIGHT JOIN¶

-- show all records from df2SELECT*FROMdf1RIGHTOUTERJOINdf2ONdf1.key=df2.key;

# show all records from df2In [28]:pd.merge(df1,df2,on='key',how='right')Out[28]:  key   value_x   value_y0   B -0.318214  0.5435811   D  2.169960 -0.4260672   D  2.169960  1.1380793   E       NaN  0.086073

FULL JOIN¶

pandas also allows for FULL JOINs, which display both sides of the dataset, whether or not thejoined columns find a match. As of writing, FULL JOINs are not supported in all RDBMS (MySQL).

-- show all records from both tablesSELECT*FROMdf1FULLOUTERJOINdf2ONdf1.key=df2.key;

# show all records from both framesIn [29]:pd.merge(df1,df2,on='key',how='outer')Out[29]:  key   value_x   value_y0   A  0.116174       NaN1   B -0.318214  0.5435812   C  0.285261       NaN3   D  2.169960 -0.4260674   D  2.169960  1.1380795   E       NaN  0.086073

UNION¶

UNION ALL can be performed usingconcat().

In [30]:df1=pd.DataFrame({'city':['Chicago','San Francisco','New York City'],   ....:'rank':range(1,4)})   ....:In [31]:df2=pd.DataFrame({'city':['Chicago','Boston','Los Angeles'],   ....:'rank':[1,4,5]})   ....:

SELECTcity,rankFROMdf1UNIONALLSELECTcity,rankFROMdf2;/*         city  rank      Chicago     1San Francisco     2New York City     3      Chicago     1       Boston     4  Los Angeles     5*/

In [32]:pd.concat([df1,df2])Out[32]:            city  rank0        Chicago     11  San Francisco     22  New York City     30        Chicago     11         Boston     42    Los Angeles     5

SQL’s UNION is similar to UNION ALL, however UNION will remove duplicate rows.

SELECTcity,rankFROMdf1UNIONSELECTcity,rankFROMdf2;-- notice that there is only one Chicago record this time/*         city  rank      Chicago     1San Francisco     2New York City     3       Boston     4  Los Angeles     5*/

In pandas, you can useconcat() in conjunction withdrop_duplicates().

In [33]:pd.concat([df1,df2]).drop_duplicates()Out[33]:            city  rank0        Chicago     11  San Francisco     22  New York City     31         Boston     42    Los Angeles     5

Pandas equivalents for some SQL analytic and aggregate functions¶

Top N rows with offset¶

-- MySQLSELECT*FROMtipsORDERBYtipDESCLIMIT10OFFSET5;

In [34]:tips.nlargest(10+5,columns='tip').tail(10)Out[34]:     total_bill   tip     sex smoker   day    time  size183       23.17  6.50    Male    Yes   Sun  Dinner     4214       28.17  6.50  Female    Yes   Sat  Dinner     347        32.40  6.00    Male     No   Sun  Dinner     4239       29.03  5.92    Male     No   Sat  Dinner     388        24.71  5.85    Male     No  Thur   Lunch     2181       23.33  5.65    Male    Yes   Sun  Dinner     244        30.40  5.60    Male     No   Sun  Dinner     452        34.81  5.20  Female     No   Sun  Dinner     485        34.83  5.17  Female     No  Thur   Lunch     4211       25.89  5.16    Male    Yes   Sat  Dinner     4

Top N rows per group¶

-- Oracle's ROW_NUMBER() analytic functionSELECT*FROM(SELECTt.*,ROW_NUMBER()OVER(PARTITIONBYdayORDERBYtotal_billDESC)ASrnFROMtipst)WHERErn<3ORDERBYday,rn;

In [35]:(tips.assign(rn=tips.sort_values(['total_bill'],ascending=False)   ....:.groupby(['day'])   ....:.cumcount()+1)   ....:.query('rn < 3')   ....:.sort_values(['day','rn'])   ....:)   ....:Out[35]:     total_bill    tip     sex smoker   day    time  size  rn95        40.17   4.73    Male    Yes   Fri  Dinner     4   190        28.97   3.00    Male    Yes   Fri  Dinner     2   2170       50.81  10.00    Male    Yes   Sat  Dinner     3   1212       48.33   9.00    Male     No   Sat  Dinner     4   2156       48.17   5.00    Male     No   Sun  Dinner     6   1182       45.35   3.50    Male    Yes   Sun  Dinner     3   2197       43.11   5.00  Female    Yes  Thur   Lunch     4   1142       41.19   5.00    Male     No  Thur   Lunch     5   2

the same usingrank(method=’first’) function

In [36]:(tips.assign(rnk=tips.groupby(['day'])['total_bill']   ....:.rank(method='first',ascending=False))   ....:.query('rnk < 3')   ....:.sort_values(['day','rnk'])   ....:)   ....:Out[36]:     total_bill    tip     sex smoker   day    time  size  rnk95        40.17   4.73    Male    Yes   Fri  Dinner     4  1.090        28.97   3.00    Male    Yes   Fri  Dinner     2  2.0170       50.81  10.00    Male    Yes   Sat  Dinner     3  1.0212       48.33   9.00    Male     No   Sat  Dinner     4  2.0156       48.17   5.00    Male     No   Sun  Dinner     6  1.0182       45.35   3.50    Male    Yes   Sun  Dinner     3  2.0197       43.11   5.00  Female    Yes  Thur   Lunch     4  1.0142       41.19   5.00    Male     No  Thur   Lunch     5  2.0

-- Oracle's RANK() analytic functionSELECT*FROM(SELECTt.*,RANK()OVER(PARTITIONBYsexORDERBYtip)ASrnkFROMtipstWHEREtip<2)WHERErnk<3ORDERBYsex,rnk;

Let’s find tips with (rank < 3) per gender group for (tips < 2).Notice that when usingrank(method='min') functionrnk_min remains the same for the sametip(as Oracle’s RANK() function)

In [37]:(tips[tips['tip']<2]   ....:.assign(rnk_min=tips.groupby(['sex'])['tip']   ....:.rank(method='min'))   ....:.query('rnk_min < 3')   ....:.sort_values(['sex','rnk_min'])   ....:)   ....:Out[37]:     total_bill   tip     sex smoker  day    time  size  rnk_min67         3.07  1.00  Female    Yes  Sat  Dinner     1      1.092         5.75  1.00  Female    Yes  Fri  Dinner     2      1.0111        7.25  1.00  Female     No  Sat  Dinner     1      1.0236       12.60  1.00    Male    Yes  Sat  Dinner     2      1.0237       32.83  1.17    Male    Yes  Sat  Dinner     2      2.0

UPDATE¶

UPDATEtipsSETtip=tip*2WHEREtip<2;

In [38]:tips.loc[tips['tip']<2,'tip']*=2

DELETE¶

DELETEFROMtipsWHEREtip>9;

In pandas we select the rows that should remain, instead of deleting them

In [39]:tips=tips.loc[tips['tip']<=9]

Movatterモバイル変換

Navigation

Table Of Contents