Prediction (out of sample)¶
[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
plt.rc("figure", figsize=(16, 8))
plt.rc("font", size=14)
Artificial data¶
[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1 - 5) ** 2))
X = sm.add_constant(X)
beta = [5.0, 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)
Estimation¶
[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.984
Model: OLS Adj. R-squared: 0.983
Method: Least Squares F-statistic: 933.7
Date: Wed, 26 Jun 2024 Prob (F-statistic): 3.38e-41
Time: 19:11:00 Log-Likelihood: 0.62646
No. Observations: 50 AIC: 6.747
Df Residuals: 46 BIC: 14.40
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 5.0126 0.085 59.031 0.000 4.842 5.183
x1 0.5065 0.013 38.673 0.000 0.480 0.533
x2 0.5542 0.051 10.765 0.000 0.451 0.658
x3 -0.0203 0.001 -17.689 0.000 -0.023 -0.018
==============================================================================
Omnibus: 2.440 Durbin-Watson: 1.986
Prob(Omnibus): 0.295 Jarque-Bera (JB): 2.057
Skew: 0.371 Prob(JB): 0.358
Kurtosis: 2.338 Cond. No. 221.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In-sample prediction¶
[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.50408035 5.01039129 5.47378562 5.86406125 6.16191588 6.36211827
6.47436778 6.52170074 6.53670567 6.55616908 6.61503155 6.74064706
6.94828888 7.23864013 7.5976816 7.99899506 8.40810369 8.78813652
9.10588624 9.33726505 9.47126223 9.51175318 9.47686268 9.39598705
9.30496442 9.24018661 9.2326201 9.30271828 9.45706103 9.68727366
9.97140289 10.27752362 10.56898425 10.81042832 10.97360357 11.04200365
11.01358009 10.90107883 10.72994615 10.53414787 10.35058888 10.21304978
10.1466368 10.16365694 10.26159595 10.42353143 10.6209131 10.81825359
10.97895884 11.07134045]
Create a new sample of explanatory variables Xnew, predict and plot¶
[6]:
x1n = np.linspace(20.5, 25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n - 5) ** 2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[11.06088516 10.90500894 10.62544446 10.27171775 9.90902247 9.60225825
9.40014101 9.32327555 9.35911084 9.46501291]
Plot comparison¶
[7]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(x1, y, "o", label="Data")
ax.plot(x1, y_true, "b-", label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), "r", label="OLS prediction")
ax.legend(loc="best")
[7]:
<matplotlib.legend.Legend at 0x7f42f4cd6f50>

Predicting with Formulas¶
Using formulas can make both estimation and prediction a lot easier
[8]:
from statsmodels.formula.api import ols
data = {"x1": x1, "y": y}
res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()
We use the I
to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2
[9]:
res.params
[9]:
Intercept 5.012557
x1 0.506457
np.sin(x1) 0.554176
I((x1 - 5) ** 2) -0.020339
dtype: float64
Now we only have to pass the single variable and we get the transformed right-hand side variables automatically
[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0 11.060885
1 10.905009
2 10.625444
3 10.271718
4 9.909022
5 9.602258
6 9.400141
7 9.323276
8 9.359111
9 9.465013
dtype: float64