I have apandas data frame and I would like to able to predict the values of column A from the values in columns B and C. Here is a toy example:
import pandas as pddf = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})Ideally, I would have something likeols(A ~ B + C, data = df) but when I look at theexamples from algorithm libraries likescikit-learn it appears to feed the data to the model with a list of rows instead of columns. This would require me to reformat the data into lists inside lists, which seems to defeat the purpose of using pandas in the first place. What is the most pythonic way to run an OLS regression (or any machine learning algorithm more generally) on data in a pandas data frame?
6 Answers6
I think you can almost do exactly what you thought would be ideal, using thestatsmodels package which was one ofpandas' optional dependencies beforepandas' version 0.20.0 (it was used for a few things inpandas.stats.)
>>> import pandas as pd>>> import statsmodels.formula.api as sm>>> df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})>>> result = sm.ols(formula="A ~ B + C", data=df).fit()>>> print(result.params)Intercept 14.952480B 0.401182C 0.000352dtype: float64>>> print(result.summary()) OLS Regression Results ==============================================================================Dep. Variable: A R-squared: 0.579Model: OLS Adj. R-squared: 0.158Method: Least Squares F-statistic: 1.375Date: Thu, 14 Nov 2013 Prob (F-statistic): 0.421Time: 20:04:30 Log-Likelihood: -18.178No. Observations: 5 AIC: 42.36Df Residuals: 2 BIC: 41.19Df Model: 2 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.]------------------------------------------------------------------------------Intercept 14.9525 17.764 0.842 0.489 -61.481 91.386B 0.4012 0.650 0.617 0.600 -2.394 3.197C 0.0004 0.001 0.650 0.583 -0.002 0.003==============================================================================Omnibus: nan Durbin-Watson: 1.061Prob(Omnibus): nan Jarque-Bera (JB): 0.498Skew: -0.123 Prob(JB): 0.780Kurtosis: 1.474 Cond. No. 5.21e+04==============================================================================Warnings:[1] The condition number is large, 5.21e+04. This might indicate that there arestrong multicollinearity or other numerical problems.5 Comments
formula, I accidentally typedformulas instead and got weird error:TypeError: from_formula() takes at least 3 arguments (2 given)print(result.params) andprint(result.summary())formula() approach throws the type error TypeError: __init__() missing 1 required positional argument: 'endog', so i guess it's deprecated. also,ols is nowOLSresult = sm.OLS.from_formula(formula="A ~ B + C", data=df).fit()Note:pandas.statshas been removed with 0.20.0
It's possible to do this withpandas.stats.ols:
>>> from pandas.stats.api import ols>>> df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})>>> res = ols(y=df['A'], x=df[['B','C']])>>> res-------------------------Summary of Regression Analysis-------------------------Formula: Y ~ <B> + <C> + <intercept>Number of Observations: 5Number of Degrees of Freedom: 3R-squared: 0.5789Adj R-squared: 0.1577Rmse: 14.5108F-stat (2, 2): 1.3746, p-value: 0.4211Degrees of Freedom: model 2, resid 2-----------------------Summary of Estimated Coefficients------------------------ Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%-------------------------------------------------------------------------------- B 0.4012 0.6497 0.62 0.5999 -0.8723 1.6746 C 0.0004 0.0005 0.65 0.5826 -0.0007 0.0014 intercept 14.9525 17.7643 0.84 0.4886 -19.8655 49.7705---------------------------------End of Summary---------------------------------Note that you need to havestatsmodels package installed, it is used internally by thepandas.stats.ols function.
4 Comments
The pandas.stats.ols module is deprecated and will be removed in a future version. We refer to external packages like statsmodels, see some examples here: http://www.statsmodels.org/stable/regression.htmlmissing intercepts. The designer of the equivalentR package adjusts by removing the adjustment for the mean:stats.stackexchange.com/a/36068/64552 . . Other suggestions:you can use sm.add_constant to add an intercept to the exog array and use a dict:reg = ols("y ~ x", data=dict(y=y,x=x)).fit()I don't know if this is new insklearn orpandas, but I'm able to pass the data frame directly tosklearn without converting the data frame to a numpy array or any other data types.
from sklearn import linear_modelreg = linear_model.LinearRegression()reg.fit(df[['B', 'C']], df['A'])>>> reg.coef_array([ 4.01182386e-01, 3.51587361e-04])1 Comment
.values.reshape(-1, 1) to the dataframe columns. For example:x_data = df['x_data'].values.reshape(-1, 1) and passing thex_data (and a similarly createdy_data) np arrays into the.fit() method.This would require me to reformat the data into lists inside lists, which seems to defeat the purpose of using pandas in the first place.
No it doesn't, just convert to a NumPy array:
>>> data = np.asarray(df)This takes constant time because it just creates aview on your data. Then feed it to scikit-learn:
>>> from sklearn.linear_model import LinearRegression>>> lr = LinearRegression()>>> X, y = data[:, 1:], data[:, 0]>>> lr.fit(X, y)LinearRegression(copy_X=True, fit_intercept=True, normalize=False)>>> lr.coef_array([ 4.01182386e-01, 3.51587361e-04])>>> lr.intercept_14.9524795039536725 Comments
np.matrix( np.asarray( df ) ), because sklearn expected a vertical vector, whereas numpy arrays, once you slice them off an array, act like horizontal vecotrs, which is great most of the time..values attribute. I.e.,reg.fit(df[['B', 'C']].values, df['A'].values).Statsmodels kan build anOLS model with column references directly to a pandas dataframe.
Short and sweet:
model = sm.OLS(df[y], df[x]).fit()
Code details and regression summary:
# importsimport pandas as pdimport statsmodels.api as smimport numpy as np# datanp.random.seed(123)df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=list('ABC'))# assign dependent and independent / explanatory variablesvariables = list(df.columns)y = 'A'x = [var for var in variables if var not in y ]# Ordinary least squares regressionmodel_Simple = sm.OLS(df[y], df[x]).fit()# Add a constant term like so:model = sm.OLS(df[y], sm.add_constant(df[x])).fit()model.summary()Output:
OLS Regression Results ==============================================================================Dep. Variable: A R-squared: 0.019Model: OLS Adj. R-squared: -0.001Method: Least Squares F-statistic: 0.9409Date: Thu, 14 Feb 2019 Prob (F-statistic): 0.394Time: 08:35:04 Log-Likelihood: -484.49No. Observations: 100 AIC: 975.0Df Residuals: 97 BIC: 982.8Df Model: 2 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975]------------------------------------------------------------------------------const 43.4801 8.809 4.936 0.000 25.996 60.964B 0.1241 0.105 1.188 0.238 -0.083 0.332C -0.0752 0.110 -0.681 0.497 -0.294 0.144==============================================================================Omnibus: 50.990 Durbin-Watson: 2.013Prob(Omnibus): 0.000 Jarque-Bera (JB): 6.905Skew: 0.032 Prob(JB): 0.0317Kurtosis: 1.714 Cond. No. 231.==============================================================================How to directly get R-squared, Coefficients and p-value:
# commands:model.paramsmodel.pvaluesmodel.rsquared# demo:In[1]: model.paramsOut[1]:const 43.480106B 0.124130C -0.075156dtype: float64In[2]: model.pvaluesOut[2]: const 0.000003B 0.237924C 0.497400dtype: float64Out[3]:model.rsquaredOut[2]:0.0190Comments
B is not statistically significant. The data is not capable of drawing inferences from it. C does influence B probabilities
df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]}) avg_c=df['C'].mean() sumC=df['C'].apply(lambda x: x if x<avg_c else 0).sum() countC=df['C'].apply(lambda x: 1 if x<avg_c else None).count() avg_c2=sumC/countC df['C']=df['C'].apply(lambda x: avg_c2 if x >avg_c else x) print(df) model_ols = smf.ols("A ~ B+C",data=df).fit() print(model_ols.summary()) df[['B','C']].plot() plt.show() df2=pd.DataFrame() df2['B']=np.linspace(10,50,10) df2['C']=30 df3=pd.DataFrame() df3['B']=np.linspace(10,50,10) df3['C']=100 predB=model_ols.predict(df2) predC=model_ols.predict(df3) plt.plot(df2['B'],predB,label='predict B C=30') plt.plot(df3['B'],predC,label='predict B C=100') plt.legend() plt.show() print("A change in the probability of C affects the probability of B") intercept=model_ols.params.loc['Intercept'] B_slope=model_ols.params.loc['B'] C_slope=model_ols.params.loc['C'] #Intercept 11.874252 #B 0.760859 #C -0.060257 print("Intercept {}\n B slope{}\n C slope{}\n".format(intercept,B_slope,C_slope)) #lower_conf,upper_conf=np.exp(model_ols.conf_int()) #print(lower_conf,upper_conf) #print((1-(lower_conf/upper_conf))*100) model_cov=model_ols.cov_params() std_errorB = np.sqrt(model_cov.loc['B', 'B']) std_errorC = np.sqrt(model_cov.loc['C', 'C']) print('SE: ', round(std_errorB, 4),round(std_errorC, 4)) #check for statistically significant print("B z value {} C z value {}".format((B_slope/std_errorB),(C_slope/std_errorC))) print("B feature is more statistically significant than C") Output: A change in the probability of C affects the probability of B Intercept 11.874251554067563 B slope0.7608594144571961 C slope-0.060256845997223814 Standard Error: 0.4519 0.0793 B z value 1.683510336937001 C z value -0.7601036314930376 B feature is more statistically significant than C z>2 is statistically significantComments
Explore related questions
See similar questions with these tags.








