138

I have apandas data frame and I would like to able to predict the values of column A from the values in columns B and C. Here is a toy example:

import pandas as pddf = pd.DataFrame({"A": [10,20,30,40,50],                    "B": [20, 30, 10, 40, 50],                    "C": [32, 234, 23, 23, 42523]})

Ideally, I would have something likeols(A ~ B + C, data = df) but when I look at theexamples from algorithm libraries likescikit-learn it appears to feed the data to the model with a list of rows instead of columns. This would require me to reformat the data into lists inside lists, which seems to defeat the purpose of using pandas in the first place. What is the most pythonic way to run an OLS regression (or any machine learning algorithm more generally) on data in a pandas data frame?

den.run.ai's user avatar
den.run.ai
5,97318 gold badges88 silver badges146 bronze badges
askedNov 15, 2013 at 0:47
Michael's user avatar
0

6 Answers6

191

I think you can almost do exactly what you thought would be ideal, using thestatsmodels package which was one ofpandas' optional dependencies beforepandas' version 0.20.0 (it was used for a few things inpandas.stats.)

>>> import pandas as pd>>> import statsmodels.formula.api as sm>>> df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})>>> result = sm.ols(formula="A ~ B + C", data=df).fit()>>> print(result.params)Intercept    14.952480B             0.401182C             0.000352dtype: float64>>> print(result.summary())                            OLS Regression Results                            ==============================================================================Dep. Variable:                      A   R-squared:                       0.579Model:                            OLS   Adj. R-squared:                  0.158Method:                 Least Squares   F-statistic:                     1.375Date:                Thu, 14 Nov 2013   Prob (F-statistic):              0.421Time:                        20:04:30   Log-Likelihood:                -18.178No. Observations:                   5   AIC:                             42.36Df Residuals:                       2   BIC:                             41.19Df Model:                           2                                         ==============================================================================                 coef    std err          t      P>|t|      [95.0% Conf. Int.]------------------------------------------------------------------------------Intercept     14.9525     17.764      0.842      0.489       -61.481    91.386B              0.4012      0.650      0.617      0.600        -2.394     3.197C              0.0004      0.001      0.650      0.583        -0.002     0.003==============================================================================Omnibus:                          nan   Durbin-Watson:                   1.061Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.498Skew:                          -0.123   Prob(JB):                        0.780Kurtosis:                       1.474   Cond. No.                     5.21e+04==============================================================================Warnings:[1] The condition number is large, 5.21e+04. This might indicate that there arestrong multicollinearity or other numerical problems.
Bill's user avatar
Bill
46.1k25 gold badges128 silver badges219 bronze badges
answeredNov 15, 2013 at 1:05
DSM's user avatar
Sign up to request clarification or add additional context in comments.

5 Comments

Note that correct keyword isformula, I accidentally typedformulas instead and got weird error:TypeError: from_formula() takes at least 3 arguments (2 given)
@DSM Very new to python. Tried running your same code and got errors on both print messages: print result.summary() ^ SyntaxError: invalid syntax >>> print result.parmas File "<stdin>", line 1 print result.parmas ^ SyntaxError: Missing parentheses in call to 'print'...Maybe I loaded packages wrong?? It appears to work when I don't put "print". Thanks.
@a.powell The OP's code is for Python 2. The only change I think you need to make is to put parentheses round the arguments to print:print(result.params) andprint(result.summary())
attempting to use thisformula() approach throws the type error TypeError: __init__() missing 1 required positional argument: 'endog', so i guess it's deprecated. also,ols is nowOLS
As others mention, sm.ols has been deprecated in favor of sm.OLS. The default behavior is also different. To run a regression from formula as done here, you need to do:result = sm.OLS.from_formula(formula="A ~ B + C", data=df).fit()
78

Note:pandas.statshas been removed with 0.20.0


It's possible to do this withpandas.stats.ols:

>>> from pandas.stats.api import ols>>> df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})>>> res = ols(y=df['A'], x=df[['B','C']])>>> res-------------------------Summary of Regression Analysis-------------------------Formula: Y ~ <B> + <C> + <intercept>Number of Observations:         5Number of Degrees of Freedom:   3R-squared:         0.5789Adj R-squared:     0.1577Rmse:             14.5108F-stat (2, 2):     1.3746, p-value:     0.4211Degrees of Freedom: model 2, resid 2-----------------------Summary of Estimated Coefficients------------------------      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%--------------------------------------------------------------------------------             B     0.4012     0.6497       0.62     0.5999    -0.8723     1.6746             C     0.0004     0.0005       0.65     0.5826    -0.0007     0.0014     intercept    14.9525    17.7643       0.84     0.4886   -19.8655    49.7705---------------------------------End of Summary---------------------------------

Note that you need to havestatsmodels package installed, it is used internally by thepandas.stats.ols function.

Stefan Falk's user avatar
Stefan Falk
25.8k62 gold badges227 silver badges422 bronze badges
answeredNov 15, 2013 at 8:00
roman's user avatar

4 Comments

Note that this is going to be deprecated in future version of pandas!
The pandas.stats.ols module is deprecated and will be removed in a future version. We refer to external packages like statsmodels, see some examples here: http://www.statsmodels.org/stable/regression.html
@DestaHaileselassieHagos . This may be due to issue withmissing intercepts. The designer of the equivalentR package adjusts by removing the adjustment for the mean:stats.stackexchange.com/a/36068/64552 . . Other suggestions:you can use sm.add_constant to add an intercept to the exog array and use a dict:reg = ols("y ~ x", data=dict(y=y,x=x)).fit()
there is a strange data item for C 42523. It is an outlier. It should probably be removed or imputed to the average less 425323
35

I don't know if this is new insklearn orpandas, but I'm able to pass the data frame directly tosklearn without converting the data frame to a numpy array or any other data types.

from sklearn import linear_modelreg = linear_model.LinearRegression()reg.fit(df[['B', 'C']], df['A'])>>> reg.coef_array([  4.01182386e-01,   3.51587361e-04])
answeredJan 7, 2017 at 2:51
3novak's user avatar

1 Comment

Small diversion from the OP - but I found this particular answer very helpful, after appending.values.reshape(-1, 1) to the dataframe columns. For example:x_data = df['x_data'].values.reshape(-1, 1) and passing thex_data (and a similarly createdy_data) np arrays into the.fit() method.
19

This would require me to reformat the data into lists inside lists, which seems to defeat the purpose of using pandas in the first place.

No it doesn't, just convert to a NumPy array:

>>> data = np.asarray(df)

This takes constant time because it just creates aview on your data. Then feed it to scikit-learn:

>>> from sklearn.linear_model import LinearRegression>>> lr = LinearRegression()>>> X, y = data[:, 1:], data[:, 0]>>> lr.fit(X, y)LinearRegression(copy_X=True, fit_intercept=True, normalize=False)>>> lr.coef_array([  4.01182386e-01,   3.51587361e-04])>>> lr.intercept_14.952479503953672
answeredNov 16, 2013 at 14:14
Fred Foo's user avatar

5 Comments

I had to donp.matrix( np.asarray( df ) ), because sklearn expected a vertical vector, whereas numpy arrays, once you slice them off an array, act like horizontal vecotrs, which is great most of the time.
no simple way to do tests of the coefficients with this route, however
Isn't there a way to directly feed Scikit-Learn with Pandas DataFrame ?
for other sklearn modules (decision tree, etc), I've used df['colname'].values, but that didn't work for this.
You could also use the.values attribute. I.e.,reg.fit(df[['B', 'C']].values, df['A'].values).
19

Statsmodels kan build anOLS model with column references directly to a pandas dataframe.

Short and sweet:

model = sm.OLS(df[y], df[x]).fit()


Code details and regression summary:

# importsimport pandas as pdimport statsmodels.api as smimport numpy as np# datanp.random.seed(123)df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=list('ABC'))# assign dependent and independent / explanatory variablesvariables = list(df.columns)y = 'A'x = [var for var in variables if var not in y ]# Ordinary least squares regressionmodel_Simple = sm.OLS(df[y], df[x]).fit()# Add a constant term like so:model = sm.OLS(df[y], sm.add_constant(df[x])).fit()model.summary()

Output:

                            OLS Regression Results                            ==============================================================================Dep. Variable:                      A   R-squared:                       0.019Model:                            OLS   Adj. R-squared:                 -0.001Method:                 Least Squares   F-statistic:                    0.9409Date:                Thu, 14 Feb 2019   Prob (F-statistic):              0.394Time:                        08:35:04   Log-Likelihood:                -484.49No. Observations:                 100   AIC:                             975.0Df Residuals:                      97   BIC:                             982.8Df Model:                           2                                         Covariance Type:            nonrobust                                         ==============================================================================                 coef    std err          t      P>|t|      [0.025      0.975]------------------------------------------------------------------------------const         43.4801      8.809      4.936      0.000      25.996      60.964B              0.1241      0.105      1.188      0.238      -0.083       0.332C             -0.0752      0.110     -0.681      0.497      -0.294       0.144==============================================================================Omnibus:                       50.990   Durbin-Watson:                   2.013Prob(Omnibus):                  0.000   Jarque-Bera (JB):                6.905Skew:                           0.032   Prob(JB):                       0.0317Kurtosis:                       1.714   Cond. No.                         231.==============================================================================

How to directly get R-squared, Coefficients and p-value:

# commands:model.paramsmodel.pvaluesmodel.rsquared# demo:In[1]: model.paramsOut[1]:const    43.480106B         0.124130C        -0.075156dtype: float64In[2]: model.pvaluesOut[2]: const    0.000003B        0.237924C        0.497400dtype: float64Out[3]:model.rsquaredOut[2]:0.0190
Bill's user avatar
Bill
46.1k25 gold badges128 silver badges219 bronze badges
answeredFeb 14, 2019 at 7:40
vestland's user avatar

Comments

0

B is not statistically significant. The data is not capable of drawing inferences from it. C does influence B probabilities

 df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]}) avg_c=df['C'].mean() sumC=df['C'].apply(lambda x: x if x<avg_c else 0).sum() countC=df['C'].apply(lambda x: 1 if x<avg_c else None).count() avg_c2=sumC/countC df['C']=df['C'].apply(lambda x: avg_c2 if x >avg_c else x)  print(df) model_ols = smf.ols("A ~ B+C",data=df).fit() print(model_ols.summary()) df[['B','C']].plot() plt.show() df2=pd.DataFrame() df2['B']=np.linspace(10,50,10) df2['C']=30 df3=pd.DataFrame() df3['B']=np.linspace(10,50,10) df3['C']=100 predB=model_ols.predict(df2) predC=model_ols.predict(df3) plt.plot(df2['B'],predB,label='predict B C=30') plt.plot(df3['B'],predC,label='predict B C=100') plt.legend() plt.show() print("A change in the probability of C affects the probability of B") intercept=model_ols.params.loc['Intercept'] B_slope=model_ols.params.loc['B'] C_slope=model_ols.params.loc['C'] #Intercept    11.874252 #B             0.760859 #C            -0.060257 print("Intercept {}\n B slope{}\n C    slope{}\n".format(intercept,B_slope,C_slope)) #lower_conf,upper_conf=np.exp(model_ols.conf_int()) #print(lower_conf,upper_conf) #print((1-(lower_conf/upper_conf))*100) model_cov=model_ols.cov_params() std_errorB = np.sqrt(model_cov.loc['B', 'B']) std_errorC = np.sqrt(model_cov.loc['C', 'C']) print('SE: ', round(std_errorB, 4),round(std_errorC, 4)) #check for statistically significant print("B z value {} C z value {}".format((B_slope/std_errorB),(C_slope/std_errorC))) print("B feature is more statistically significant than C") Output: A change in the probability of C affects the probability of B Intercept 11.874251554067563 B slope0.7608594144571961 C slope-0.060256845997223814 Standard Error:  0.4519 0.0793 B z value 1.683510336937001 C z value -0.7601036314930376 B feature is more statistically significant than C z>2 is statistically significant
answeredFeb 12, 2021 at 18:31
ListenSoftware Louise Ai Agent's user avatar

Comments

Protected question. To answer this question, you need to have at least 10 reputation on this site (not counting theassociation bonus). The reputation requirement helps protect this question from spam and non-answer activity.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.