Run an OLS regression with Pandas Data Frame

Question 1

I have apandas data frame and I would like to able to predict the values of column A from the values in columns B and C. Here is a toy example:

import pandas as pddf = pd.DataFrame({"A": [10,20,30,40,50],                    "B": [20, 30, 10, 40, 50],                    "C": [32, 234, 23, 23, 42523]})

Ideally, I would have something likeols(A ~ B + C, data = df) but when I look at theexamples from algorithm libraries likescikit-learn it appears to feed the data to the model with a list of rows instead of columns. This would require me to reformat the data into lists inside lists, which seems to defeat the purpose of using pandas in the first place. What is the most pythonic way to run an OLS regression (or any machine learning algorithm more generally) on data in a pandas data frame?

Question 2

I think you can almost do exactly what you thought would be ideal, using thestatsmodels package which was one ofpandas' optional dependencies beforepandas' version 0.20.0 (it was used for a few things inpandas.stats.)

>>> import pandas as pd>>> import statsmodels.formula.api as sm>>> df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})>>> result = sm.ols(formula="A ~ B + C", data=df).fit()>>> print(result.params)Intercept    14.952480B             0.401182C             0.000352dtype: float64>>> print(result.summary())                            OLS Regression Results                            ==============================================================================Dep. Variable:                      A   R-squared:                       0.579Model:                            OLS   Adj. R-squared:                  0.158Method:                 Least Squares   F-statistic:                     1.375Date:                Thu, 14 Nov 2013   Prob (F-statistic):              0.421Time:                        20:04:30   Log-Likelihood:                -18.178No. Observations:                   5   AIC:                             42.36Df Residuals:                       2   BIC:                             41.19Df Model:                           2                                         ==============================================================================                 coef    std err          t      P>|t|      [95.0% Conf. Int.]------------------------------------------------------------------------------Intercept     14.9525     17.764      0.842      0.489       -61.481    91.386B              0.4012      0.650      0.617      0.600        -2.394     3.197C              0.0004      0.001      0.650      0.583        -0.002     0.003==============================================================================Omnibus:                          nan   Durbin-Watson:                   1.061Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.498Skew:                          -0.123   Prob(JB):                        0.780Kurtosis:                       1.474   Cond. No.                     5.21e+04==============================================================================Warnings:[1] The condition number is large, 5.21e+04. This might indicate that there arestrong multicollinearity or other numerical problems.

Question 3

Note that correct keyword isformula, I accidentally typedformulas instead and got weird error:TypeError: from_formula() takes at least 3 arguments (2 given)

Question 4

@DSM Very new to python. Tried running your same code and got errors on both print messages: print result.summary() ^ SyntaxError: invalid syntax >>> print result.parmas File "<stdin>", line 1 print result.parmas ^ SyntaxError: Missing parentheses in call to 'print'...Maybe I loaded packages wrong?? It appears to work when I don't put "print". Thanks.

Question 5

@a.powell The OP's code is for Python 2. The only change I think you need to make is to put parentheses round the arguments to print:print(result.params) andprint(result.summary())

Question 6

attempting to use thisformula() approach throws the type error TypeError: __init__() missing 1 required positional argument: 'endog', so i guess it's deprecated. also,ols is nowOLS

Question 7

As others mention, sm.ols has been deprecated in favor of sm.OLS. The default behavior is also different. To run a regression from formula as done here, you need to do:result = sm.OLS.from_formula(formula="A ~ B + C", data=df).fit()

Question 8

Note:pandas.statshas been removed with 0.20.0

It's possible to do this withpandas.stats.ols:

>>> from pandas.stats.api import ols>>> df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})>>> res = ols(y=df['A'], x=df[['B','C']])>>> res-------------------------Summary of Regression Analysis-------------------------Formula: Y ~ <B> + <C> + <intercept>Number of Observations:         5Number of Degrees of Freedom:   3R-squared:         0.5789Adj R-squared:     0.1577Rmse:             14.5108F-stat (2, 2):     1.3746, p-value:     0.4211Degrees of Freedom: model 2, resid 2-----------------------Summary of Estimated Coefficients------------------------      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%--------------------------------------------------------------------------------             B     0.4012     0.6497       0.62     0.5999    -0.8723     1.6746             C     0.0004     0.0005       0.65     0.5826    -0.0007     0.0014     intercept    14.9525    17.7643       0.84     0.4886   -19.8655    49.7705---------------------------------End of Summary---------------------------------

Note that you need to havestatsmodels package installed, it is used internally by thepandas.stats.ols function.

Question 9

Note that this is going to be deprecated in future version of pandas!

Question 10

The pandas.stats.ols module is deprecated and will be removed in a future version. We refer to external packages like statsmodels, see some examples here: http://www.statsmodels.org/stable/regression.html

Question 11

@DestaHaileselassieHagos . This may be due to issue withmissing intercepts. The designer of the equivalentR package adjusts by removing the adjustment for the mean:stats.stackexchange.com/a/36068/64552 . . Other suggestions:you can use sm.add_constant to add an intercept to the exog array and use a dict:reg = ols("y ~ x", data=dict(y=y,x=x)).fit()

Question 12

there is a strange data item for C 42523. It is an outlier. It should probably be removed or imputed to the average less 425323

Question 13

I don't know if this is new insklearn orpandas, but I'm able to pass the data frame directly tosklearn without converting the data frame to a numpy array or any other data types.

from sklearn import linear_modelreg = linear_model.LinearRegression()reg.fit(df[['B', 'C']], df['A'])>>> reg.coef_array([  4.01182386e-01,   3.51587361e-04])

Question 14

Small diversion from the OP - but I found this particular answer very helpful, after appending.values.reshape(-1, 1) to the dataframe columns. For example:x_data = df['x_data'].values.reshape(-1, 1) and passing thex_data (and a similarly createdy_data) np arrays into the.fit() method.

Question 15

This would require me to reformat the data into lists inside lists, which seems to defeat the purpose of using pandas in the first place.

No it doesn't, just convert to a NumPy array:

>>> data = np.asarray(df)

This takes constant time because it just creates aview on your data. Then feed it to scikit-learn:

>>> from sklearn.linear_model import LinearRegression>>> lr = LinearRegression()>>> X, y = data[:, 1:], data[:, 0]>>> lr.fit(X, y)LinearRegression(copy_X=True, fit_intercept=True, normalize=False)>>> lr.coef_array([  4.01182386e-01,   3.51587361e-04])>>> lr.intercept_14.952479503953672

Question 16

I had to donp.matrix( np.asarray( df ) ), because sklearn expected a vertical vector, whereas numpy arrays, once you slice them off an array, act like horizontal vecotrs, which is great most of the time.

Question 17

no simple way to do tests of the coefficients with this route, however

Question 18

Isn't there a way to directly feed Scikit-Learn with Pandas DataFrame ?

Question 19

for other sklearn modules (decision tree, etc), I've used df['colname'].values, but that didn't work for this.

Question 20

You could also use the.values attribute. I.e.,reg.fit(df[['B', 'C']].values, df['A'].values).

Question 21

Statsmodels kan build anOLS model with column references directly to a pandas dataframe.

Short and sweet:

model = sm.OLS(df[y], df[x]).fit()

Code details and regression summary:

# importsimport pandas as pdimport statsmodels.api as smimport numpy as np# datanp.random.seed(123)df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=list('ABC'))# assign dependent and independent / explanatory variablesvariables = list(df.columns)y = 'A'x = [var for var in variables if var not in y ]# Ordinary least squares regressionmodel_Simple = sm.OLS(df[y], df[x]).fit()# Add a constant term like so:model = sm.OLS(df[y], sm.add_constant(df[x])).fit()model.summary()

Output:

                            OLS Regression Results                            ==============================================================================Dep. Variable:                      A   R-squared:                       0.019Model:                            OLS   Adj. R-squared:                 -0.001Method:                 Least Squares   F-statistic:                    0.9409Date:                Thu, 14 Feb 2019   Prob (F-statistic):              0.394Time:                        08:35:04   Log-Likelihood:                -484.49No. Observations:                 100   AIC:                             975.0Df Residuals:                      97   BIC:                             982.8Df Model:                           2                                         Covariance Type:            nonrobust                                         ==============================================================================                 coef    std err          t      P>|t|      [0.025      0.975]------------------------------------------------------------------------------const         43.4801      8.809      4.936      0.000      25.996      60.964B              0.1241      0.105      1.188      0.238      -0.083       0.332C             -0.0752      0.110     -0.681      0.497      -0.294       0.144==============================================================================Omnibus:                       50.990   Durbin-Watson:                   2.013Prob(Omnibus):                  0.000   Jarque-Bera (JB):                6.905Skew:                           0.032   Prob(JB):                       0.0317Kurtosis:                       1.714   Cond. No.                         231.==============================================================================

How to directly get R-squared, Coefficients and p-value:

# commands:model.paramsmodel.pvaluesmodel.rsquared# demo:In[1]: model.paramsOut[1]:const    43.480106B         0.124130C        -0.075156dtype: float64In[2]: model.pvaluesOut[2]: const    0.000003B        0.237924C        0.497400dtype: float64Out[3]:model.rsquaredOut[2]:0.0190

Question 22

B is not statistically significant. The data is not capable of drawing inferences from it. C does influence B probabilities

 df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]}) avg_c=df['C'].mean() sumC=df['C'].apply(lambda x: x if x<avg_c else 0).sum() countC=df['C'].apply(lambda x: 1 if x<avg_c else None).count() avg_c2=sumC/countC df['C']=df['C'].apply(lambda x: avg_c2 if x >avg_c else x)  print(df) model_ols = smf.ols("A ~ B+C",data=df).fit() print(model_ols.summary()) df[['B','C']].plot() plt.show() df2=pd.DataFrame() df2['B']=np.linspace(10,50,10) df2['C']=30 df3=pd.DataFrame() df3['B']=np.linspace(10,50,10) df3['C']=100 predB=model_ols.predict(df2) predC=model_ols.predict(df3) plt.plot(df2['B'],predB,label='predict B C=30') plt.plot(df3['B'],predC,label='predict B C=100') plt.legend() plt.show() print("A change in the probability of C affects the probability of B") intercept=model_ols.params.loc['Intercept'] B_slope=model_ols.params.loc['B'] C_slope=model_ols.params.loc['C'] #Intercept    11.874252 #B             0.760859 #C            -0.060257 print("Intercept {}\n B slope{}\n C    slope{}\n".format(intercept,B_slope,C_slope)) #lower_conf,upper_conf=np.exp(model_ols.conf_int()) #print(lower_conf,upper_conf) #print((1-(lower_conf/upper_conf))*100) model_cov=model_ols.cov_params() std_errorB = np.sqrt(model_cov.loc['B', 'B']) std_errorC = np.sqrt(model_cov.loc['C', 'C']) print('SE: ', round(std_errorB, 4),round(std_errorC, 4)) #check for statistically significant print("B z value {} C z value {}".format((B_slope/std_errorB),(C_slope/std_errorC))) print("B feature is more statistically significant than C") Output: A change in the probability of C affects the probability of B Intercept 11.874251554067563 B slope0.7608594144571961 C slope-0.060256845997223814 Standard Error:  0.4519 0.0793 B z value 1.683510336937001 C z value -0.7601036314930376 B feature is more statistically significant than C z>2 is statistically significant

DSM 355k67 gold badges606 silver badges504 bronze badges · Accepted Answer · 2019-03-16 13:08:37Z

I think you can almost do exactly what you thought would be ideal, using thestatsmodels package which was one ofpandas' optional dependencies beforepandas' version 0.20.0 (it was used for a few things inpandas.stats.)

>>> import pandas as pd>>> import statsmodels.formula.api as sm>>> df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})>>> result = sm.ols(formula="A ~ B + C", data=df).fit()>>> print(result.params)Intercept    14.952480B             0.401182C             0.000352dtype: float64>>> print(result.summary())                            OLS Regression Results                            ==============================================================================Dep. Variable:                      A   R-squared:                       0.579Model:                            OLS   Adj. R-squared:                  0.158Method:                 Least Squares   F-statistic:                     1.375Date:                Thu, 14 Nov 2013   Prob (F-statistic):              0.421Time:                        20:04:30   Log-Likelihood:                -18.178No. Observations:                   5   AIC:                             42.36Df Residuals:                       2   BIC:                             41.19Df Model:                           2                                         ==============================================================================                 coef    std err          t      P>|t|      [95.0% Conf. Int.]------------------------------------------------------------------------------Intercept     14.9525     17.764      0.842      0.489       -61.481    91.386B              0.4012      0.650      0.617      0.600        -2.394     3.197C              0.0004      0.001      0.650      0.583        -0.002     0.003==============================================================================Omnibus:                          nan   Durbin-Watson:                   1.061Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.498Skew:                          -0.123   Prob(JB):                        0.780Kurtosis:                       1.474   Cond. No.                     5.21e+04==============================================================================Warnings:[1] The condition number is large, 5.21e+04. This might indicate that there arestrong multicollinearity or other numerical problems.

Note that correct keyword isformula, I accidentally typedformulas instead and got weird error:TypeError: from_formula() takes at least 3 arguments (2 given)
@DSM Very new to python. Tried running your same code and got errors on both print messages: print result.summary() ^ SyntaxError: invalid syntax >>> print result.parmas File "<stdin>", line 1 print result.parmas ^ SyntaxError: Missing parentheses in call to 'print'...Maybe I loaded packages wrong?? It appears to work when I don't put "print". Thanks.
@a.powell The OP's code is for Python 2. The only change I think you need to make is to put parentheses round the arguments to print:print(result.params) andprint(result.summary())
attempting to use thisformula() approach throws the type error TypeError: __init__() missing 1 required positional argument: 'endog', so i guess it's deprecated. also,ols is nowOLS
As others mention, sm.ols has been deprecated in favor of sm.OLS. The default behavior is also different. To run a regression from formula as done here, you need to do:result = sm.OLS.from_formula(formula="A ~ B + C", data=df).fit()

Movatterモバイル変換

Collectives™ on Stack Overflow

Run an OLS regression with Pandas Data Frame

6 Answers6

5 Comments

4 Comments

1 Comment

5 Comments

Comments

Comments

Linked

Related

Hot Network Questions

Subscribe to RSS