Why shrinkage methods become state-of-the-art?

Introduction: People told me that stepwise regression (forward selection / backward / bidirectional elimination) is false, and that LASSO / RIDGE / combination of the two is the best automatic methods. But I dont understand why?

Answer:

Auto-insurance pricing is one of the most sophisticated task within pricing of all insurance products. When there are less than 100 variables to be tested in the model, checking one by one, and building the model step by step remains a good approach. But in the Big Data context, there can be much more variables in the data base, thus we need some kind of automatic feature selection method. Stepwise regression is widely used within regression practionners, yet it violates every principal of statistic, for example:

  1. It yields R-squared values that are badly biased to be high.
  2. The F and chi-squared tests quoted next to each variable on the printout do not have the claimed distribution.
  3. The method yields confidence intervals for effects and predicted values that are falsely narrow; see Altman and Andersen (1989).
  4. It yields p-values that do not have the proper meaning, and the proper correction for them is a difficult problem.
  5. It gives biased regression coefficients that need shrinkage (the coefficients for remaining variables are too large; see Tibshirani [1996]).
  6. It has severe problems in the presence of collinearity.
  7. It is based on methods (e.g., F tests for nested models) that were intended to be used to test prespecified hypotheses.
  8. Increasing the sample size does not help very much; see Derksen and Keselman (1992).
  9. It allows us to not think about the problem.
  10. It uses a lot of paper.

source: Frank Harell, for more detailled mathematical proofs, please refer to his book.

In my experience, stepwise regression tends to favor very granular predictors over less granular ones, which results in:

  1. Overfitting.
  2. Significant loss of information (or Underfitting).

Meanwhile, shrinkage methods is proved to be much more efficient and become one of the state-of-the-art automatic feature selection methods in modern regression. In this post, we will explore at which points shrinkage methods are good, and why.

In order to give everyone the same starting point, we need first to understand how a simple GLM / GAM model is built. For simplicity we take an example of linear regression here:

 Y  = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_p + \epsilon

Where \beta_0, \beta_1, ... are parameters of our model, and \epsilon is the mean zero random error term that we are not able to capture using the model. And X_1, X_2, ... are our predictors.

Suppose that we observe n data point (y_i, x_{1i}, x_{2i}, ..., x_{pi})_{i=1,...,n}. We can rewrite our model in the matrix form as below:

y = X \beta + \epsilon

In pratice, \beta are unknown and we need to estimate them as \hat{\beta_0}, \hat{\beta_1}, ..., \hat{\beta_p}.

Let \hat{y_i} = \hat{\beta}_0 + \hat{\beta_1} x_{1i} + ... + \hat{\beta_p} x_{pi} is the prediction for the i^{th} observaion based on the observations X_i. Then the prediction error would be: e_i = y_i - \hat{y_i} (or the i^{th} residual. The Residual Sum of Squares is:

RSS = e_1^2 + e_2^2 + ... + e_n^2

or:

RSS = \sum_n \mid y - X\hat{\beta} \mid^2

The most common method is Least Square Regression in which we find the value of parameters minimizing the RSS. The solution is:

\hat{\beta} = (X^TX)^{-1}X^T y

Fundamentally, \hat{\beta} is BLUE (Best Linear Unbiased Estimator), i.e its estimates are unbiased and it gives the lest variance among all the unbiased estimator. But unbiased estimator is not necessary better than a biased one.

Why?

It can be proved mathematically that the expected test error (MSE), for a given value of x_0 can be decomposed into three components:

E [y_0 - \hat{y_0}]^2 = Var(\hat{y_0}) + [Bias(y_0)]^2 + Var(\epsilon)

The expected error when applying the constructed model to the new blind data set is the sum of:

  1. Bias level of the estimates
  2. Variance of them
  3. Irreducable error Var(\epsilon)

Above equation tells us that, in order to build a good model performing well on the testing set, we need low variance and low bias.The traditional BLUE (i.e $latex [Bias(y_0)]^2 = 0 $ and $latex Var(\hat{y_0})$ is smallest amongs all the unbiased estimator) is not necessary the best estimator.This method can give us estimators with zero bias, but variance relatively high.

One perfect example of auto insurance pricing  relates to the model in which age of the insurer and licence age are included in the model. Or in the Big Data context, external data are included, these variables are associated with regional information and are highly correlated. If we regress frequency / average cost as a log function of insured’s age and insured’s licence age, we will find out that the matrix X^T X has a determinant closed to zero, wheares, the variance of the estimated value, under homoskedasticity, is \hat{\sigma}^2 (X^T X)^{-1}. So, if the determinant gets close to zero, the variance gets uncomfortably big.

The bias-variance trade – off is well known in modern statistic, there are situations that a little bias can resultto large variance reduction. This is exactly the idea behind LASSO/ RIDGE regression. In order to control the level of (X^T X)^{-1}, we add a matrix \Gamma^T \Gamma to it, where:

\Gamma = \lambda \mathbb{I}

\Gamma is proportional to the identical matrix by \lambda. We’re in fact minimizing the following quantity:

\mid \mid y- X \beta \mid \mid^2 + \mid \mid \Gamma \beta\mid \mid^2

which is the sum of RSS, plus a component equal to sum of squared parameters, where \lambda is the tuning parameter (or hyper-parameter). We are now not only minimizing the sum of residuals but also the second term \lambda \sum_p \beta_i^2 known as shrinkage penalty. The solution is:

\beta = (X^T X + \Gamma^T \Gamma)^{-1}X^T y

This equation tells us that even in the case of multicollinearity, the matrix that needs to be invert no longer have the determinant close to 0, than the solution doesn’t lead to undesirable variance in estimated parameters. Instead of minimizing bias to 0, and choose between all the unbiased estimator the one with lowest variance, we are minimizing the sum of bias and variance at the same time by playing over different values of \lambda  and choose the best one through cross-validation. In fact, we pay a little price by making our parameters biased, but in exchange, the variance can be so much which make the trade-off being worth.

The graph below from @gung at StackExchange helps us to visualize better the bias-variance tradeoff in a clear way:

tradeoff

inspired by the discussion on StackExchange:  http://stats.stackexchange.com/questions/20295/what-problem-do-shrinkage-methods-solve

DO Xuan Quang