Prediction (out of sample)

[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt

import statsmodels.api as sm

plt.rc("figure", figsize=(16, 8))
plt.rc("font", size=14)

Artificial data

[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1 - 5) ** 2))
X = sm.add_constant(X)
beta = [5.0, 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.984
Model:                            OLS   Adj. R-squared:                  0.983
Method:                 Least Squares   F-statistic:                     933.7
Date:                Wed, 26 Jun 2024   Prob (F-statistic):           3.38e-41
Time:                        19:11:00   Log-Likelihood:                0.62646
No. Observations:                  50   AIC:                             6.747
Df Residuals:                      46   BIC:                             14.40
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.0126      0.085     59.031      0.000       4.842       5.183
x1             0.5065      0.013     38.673      0.000       0.480       0.533
x2             0.5542      0.051     10.765      0.000       0.451       0.658
x3            -0.0203      0.001    -17.689      0.000      -0.023      -0.018
==============================================================================
Omnibus:                        2.440   Durbin-Watson:                   1.986
Prob(Omnibus):                  0.295   Jarque-Bera (JB):                2.057
Skew:                           0.371   Prob(JB):                        0.358
Kurtosis:                       2.338   Cond. No.                         221.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.50408035  5.01039129  5.47378562  5.86406125  6.16191588  6.36211827
  6.47436778  6.52170074  6.53670567  6.55616908  6.61503155  6.74064706
  6.94828888  7.23864013  7.5976816   7.99899506  8.40810369  8.78813652
  9.10588624  9.33726505  9.47126223  9.51175318  9.47686268  9.39598705
  9.30496442  9.24018661  9.2326201   9.30271828  9.45706103  9.68727366
  9.97140289 10.27752362 10.56898425 10.81042832 10.97360357 11.04200365
 11.01358009 10.90107883 10.72994615 10.53414787 10.35058888 10.21304978
 10.1466368  10.16365694 10.26159595 10.42353143 10.6209131  10.81825359
 10.97895884 11.07134045]

Create a new sample of explanatory variables Xnew, predict and plot

[6]:
x1n = np.linspace(20.5, 25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n - 5) ** 2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew)  # predict out of sample
print(ynewpred)
[11.06088516 10.90500894 10.62544446 10.27171775  9.90902247  9.60225825
  9.40014101  9.32327555  9.35911084  9.46501291]

Plot comparison

[7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, "o", label="Data")
ax.plot(x1, y_true, "b-", label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), "r", label="OLS prediction")
ax.legend(loc="best")
[7]:
<matplotlib.legend.Legend at 0x7f42f4cd6f50>
../../../_images/examples_notebooks_generated_predict_12_1.png

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

[8]:
from statsmodels.formula.api import ols

data = {"x1": x1, "y": y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2

[9]:
res.params
[9]:
Intercept           5.012557
x1                  0.506457
np.sin(x1)          0.554176
I((x1 - 5) ** 2)   -0.020339
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0    11.060885
1    10.905009
2    10.625444
3    10.271718
4     9.909022
5     9.602258
6     9.400141
7     9.323276
8     9.359111
9     9.465013
dtype: float64