Robust Linear Models¶
[1]:
%matplotlib inline
[2]:
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm
Estimation¶
Load data:
[3]:
data = sm.datasets.stackloss.load()
data.exog = sm.add_constant(data.exog)
Huber’s T norm with the (default) median absolute deviation scaling
[4]:
huber_t = sm.RLM(data.endog, data.exog, M=sm.robust.norms.HuberT())
hub_results = huber_t.fit()
print(hub_results.params)
print(hub_results.bse)
print(
hub_results.summary(
yname="y", xname=["var_%d" % i for i in range(len(hub_results.params))]
)
)
const -41.026498
AIRFLOW 0.829384
WATERTEMP 0.926066
ACIDCONC -0.127847
dtype: float64
const 9.791899
AIRFLOW 0.111005
WATERTEMP 0.302930
ACIDCONC 0.128650
dtype: float64
Robust linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 21
Model: RLM Df Residuals: 17
Method: IRLS Df Model: 3
Norm: HuberT
Scale Est.: mad
Cov Type: H1
Date: Wed, 26 Jun 2024
Time: 19:11:00
No. Iterations: 19
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
var_0 -41.0265 9.792 -4.190 0.000 -60.218 -21.835
var_1 0.8294 0.111 7.472 0.000 0.612 1.047
var_2 0.9261 0.303 3.057 0.002 0.332 1.520
var_3 -0.1278 0.129 -0.994 0.320 -0.380 0.124
==============================================================================
If the model instance has been used for another fit with different fit parameters, then the fit options might not be the correct ones anymore .
Huber’s T norm with ‘H2’ covariance matrix
[5]:
hub_results2 = huber_t.fit(cov="H2")
print(hub_results2.params)
print(hub_results2.bse)
const -41.026498
AIRFLOW 0.829384
WATERTEMP 0.926066
ACIDCONC -0.127847
dtype: float64
const 9.089504
AIRFLOW 0.119460
WATERTEMP 0.322355
ACIDCONC 0.117963
dtype: float64
Andrew’s Wave norm with Huber’s Proposal 2 scaling and ‘H3’ covariance matrix
[6]:
andrew_mod = sm.RLM(data.endog, data.exog, M=sm.robust.norms.AndrewWave())
andrew_results = andrew_mod.fit(scale_est=sm.robust.scale.HuberScale(), cov="H3")
print("Parameters: ", andrew_results.params)
Parameters: const -40.881796
AIRFLOW 0.792761
WATERTEMP 1.048576
ACIDCONC -0.133609
dtype: float64
See help(sm.RLM.fit)
for more options and module sm.robust.scale
for scale options
Comparing OLS and RLM¶
Artificial data with outliers:
[7]:
nsample = 50
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, (x1 - 5) ** 2))
X = sm.add_constant(X)
sig = 0.3 # smaller error variance makes OLS<->RLM contrast bigger
beta = [5, 0.5, -0.0]
y_true2 = np.dot(X, beta)
y2 = y_true2 + sig * 1.0 * np.random.normal(size=nsample)
y2[[39, 41, 43, 45, 48]] -= 5 # add some outliers (10% of nsample)
Example 1: quadratic function with linear truth¶
Note that the quadratic term in OLS regression will capture outlier effects.
[8]:
res = sm.OLS(y2, X).fit()
print(res.params)
print(res.bse)
print(res.predict())
[ 5.28064678 0.49938233 -0.01264708]
[0.44124293 0.06812195 0.00602774]
[ 4.96446988 5.21781315 5.46694249 5.71185789 5.95255936 6.18904689
6.42132048 6.64938014 6.87322586 7.09285764 7.30827549 7.5194794
7.72646937 7.92924541 8.12780752 8.32215568 8.51228991 8.69821021
8.87991656 9.05740898 9.23068747 9.39975202 9.56460263 9.72523931
9.88166205 10.03387085 10.18186572 10.32564665 10.46521365 10.6005667
10.73170583 10.85863101 10.98134226 11.09983958 11.21412296 11.3241924
11.4300479 11.53168947 11.6291171 11.7223308 11.81133056 11.89611638
11.97668827 12.05304622 12.12519024 12.19312032 12.25683646 12.31633867
12.37162694 12.42270127]
Estimate RLM:
[9]:
resrlm = sm.RLM(y2, X).fit()
print(resrlm.params)
print(resrlm.bse)
[ 5.19819862e+00 4.88533246e-01 -2.85699589e-03]
[0.12195747 0.01882859 0.00166604]
Draw a plot to compare OLS estimates to the robust estimates:
[10]:
fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(111)
ax.plot(x1, y2, "o", label="data")
ax.plot(x1, y_true2, "b-", label="True")
pred_ols = res.get_prediction()
iv_l = pred_ols.summary_frame()["obs_ci_lower"]
iv_u = pred_ols.summary_frame()["obs_ci_upper"]
ax.plot(x1, res.fittedvalues, "r-", label="OLS")
ax.plot(x1, iv_u, "r--")
ax.plot(x1, iv_l, "r--")
ax.plot(x1, resrlm.fittedvalues, "g.-", label="RLM")
ax.legend(loc="best")
[10]:
<matplotlib.legend.Legend at 0x7f405df626d0>

Example 2: linear function with linear truth¶
Fit a new OLS model using only the linear term and the constant:
[11]:
X2 = X[:, [0, 1]]
res2 = sm.OLS(y2, X2).fit()
print(res2.params)
print(res2.bse)
[5.79040138 0.37291157]
[0.38116183 0.03284244]
Estimate RLM:
[12]:
resrlm2 = sm.RLM(y2, X2).fit()
print(resrlm2.params)
print(resrlm2.bse)
[5.30237303 0.46222897]
[0.09709979 0.00836651]
Draw a plot to compare OLS estimates to the robust estimates:
[13]:
pred_ols = res2.get_prediction()
iv_l = pred_ols.summary_frame()["obs_ci_lower"]
iv_u = pred_ols.summary_frame()["obs_ci_upper"]
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(x1, y2, "o", label="data")
ax.plot(x1, y_true2, "b-", label="True")
ax.plot(x1, res2.fittedvalues, "r-", label="OLS")
ax.plot(x1, iv_u, "r--")
ax.plot(x1, iv_l, "r--")
ax.plot(x1, resrlm2.fittedvalues, "g.-", label="RLM")
legend = ax.legend(loc="best")
