# Why I Stopped Reading Coefficients Of Linear Regression

An oversimplified answer to a complicated question

I train a lot of machine learning models to make predictions and invariably the first question I am asked is ‘how is your model making those decisions?’. Luckily for me, up until now a lot of the models I have trained are variations on linear regression, widely claimed to be the most interpretable of models. But after completing a more involved problem earlier this year and thinking about it more in depth, I really started to question the effectiveness of my approach.

## Linear Regression Coefficients

I am going to assume that if you are reading this then you have at least a basic understanding of linear regression. In general we are trying to find a linear relationship between our input variables $$\underline{x}$$ and our out put variable $$y$$ of the form

$E[y] = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_px_p$

This model is completely interpretable. We know for sure that when we increase our input variable ($$x_i$$) by a value of 1 then our expected output variable ($$E[y]$$) will increase by the value of the coefficient preceding it ($$\beta_i$$).

So what is my problem? Everything I have said above is completely true, the linear regression model is completely interpretable and I can tell you exactly how it will behave when any of its inputs change. The problem comes when you start trying to apply your knowledge of the model to the real world system. As the saying goes ‘All models are wrong’ and it is this inaccuracy in representation of the system is where our problem lies, but maybe not in the way that you might first think.

## The Boston Housing Data

For this example I will be using the Boston Housing dataset. This is available to download online or comes as a part of the mlbench library, which is how I will be accessing it.

# Load the mlbench library containing the data set
library(mlbench)

# Load the Boston Housing Dataset
data("BostonHousing")

# Turn all entries into a number
data <- apply(BostonHousing, 2, as.numeric)
data <- as.data.frame(data)
## # A tibble: 506 × 14
##       crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio     b
##      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl>
##  1 0.00632  18    2.31     0 0.538  6.58  65.2  4.09     1   296    15.3  397.
##  2 0.0273    0    7.07     0 0.469  6.42  78.9  4.97     2   242    17.8  397.
##  3 0.0273    0    7.07     0 0.469  7.18  61.1  4.97     2   242    17.8  393.
##  4 0.0324    0    2.18     0 0.458  7.00  45.8  6.06     3   222    18.7  395.
##  5 0.0690    0    2.18     0 0.458  7.15  54.2  6.06     3   222    18.7  397.
##  6 0.0298    0    2.18     0 0.458  6.43  58.7  6.06     3   222    18.7  394.
##  7 0.0883   12.5  7.87     0 0.524  6.01  66.6  5.56     5   311    15.2  396.
##  8 0.145    12.5  7.87     0 0.524  6.17  96.1  5.95     5   311    15.2  397.
##  9 0.211    12.5  7.87     0 0.524  5.63 100    6.08     5   311    15.2  387.
## 10 0.170    12.5  7.87     0 0.524  6.00  85.9  6.59     5   311    15.2  387.
## # … with 496 more rows, and 2 more variables: lstat <dbl>, medv <dbl>

If you would like to kow more about each of the variables in the data set you can use the help function

?BostonHousing

Now we are going to train a linear model to the data

mod <- lm(medv ~ ., data = data)
mod$coefficients ## (Intercept) crim zn indus chas ## 3.645949e+01 -1.080114e-01 4.642046e-02 2.055863e-02 2.686734e+00 ## nox rm age dis rad ## -1.776661e+01 3.809865e+00 6.922246e-04 -1.475567e+00 3.060495e-01 ## tax ptratio b lstat ## -1.233459e-02 -9.527472e-01 9.311683e-03 -5.247584e-01 Here we can see rm has a coefficient of about 3.8 which is the largest in size (ignoring the intercept), does this mean that it is the most important predictor in the data set? The obvious answer here is: we can’t say. This is because all of our variables are using different units and this will have a big effect on the coefficients. So before we train a linear model we should be centering and scaling our data # Centering and scaling the data data <- scale(data, center = TRUE, scale = TRUE) data <- as.data.frame(data) ## # A tibble: 506 × 14 ## crim zn indus chas nox rm age dis rad tax ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 -0.419 0.285 -1.29 -0.272 -0.144 0.413 -0.120 0.140 -0.982 -0.666 ## 2 -0.417 -0.487 -0.593 -0.272 -0.740 0.194 0.367 0.557 -0.867 -0.986 ## 3 -0.417 -0.487 -0.593 -0.272 -0.740 1.28 -0.266 0.557 -0.867 -0.986 ## 4 -0.416 -0.487 -1.31 -0.272 -0.834 1.02 -0.809 1.08 -0.752 -1.11 ## 5 -0.412 -0.487 -1.31 -0.272 -0.834 1.23 -0.511 1.08 -0.752 -1.11 ## 6 -0.417 -0.487 -1.31 -0.272 -0.834 0.207 -0.351 1.08 -0.752 -1.11 ## 7 -0.410 0.0487 -0.476 -0.272 -0.265 -0.388 -0.0702 0.838 -0.522 -0.577 ## 8 -0.403 0.0487 -0.476 -0.272 -0.265 -0.160 0.978 1.02 -0.522 -0.577 ## 9 -0.396 0.0487 -0.476 -0.272 -0.265 -0.930 1.12 1.09 -0.522 -0.577 ## 10 -0.400 0.0487 -0.476 -0.272 -0.265 -0.399 0.615 1.33 -0.522 -0.577 ## # … with 496 more rows, and 4 more variables: ptratio <dbl>, b <dbl>, ## # lstat <dbl>, medv <dbl> Lets see what is the “most important” variable according to linear regression now? mod <- lm(medv ~ ., data = data) mod$coefficients
##   (Intercept)          crim            zn         indus          chas
## -1.627634e-16 -1.010171e-01  1.177152e-01  1.533520e-02  7.419883e-02
##           nox            rm           age           dis           rad
## -2.238480e-01  2.910565e-01  2.118638e-03 -3.378363e-01  2.897491e-01
##           tax       ptratio             b         lstat
## -2.260317e-01 -2.242712e-01  9.243223e-02 -4.074469e-01

Here we we can see lstat is now considered to be the most important. Now we can try and bootstrap these coefficients to try and get some idea of a confidence interval on them. To perform a bootstrap, we will resample our data (randomly select rows) with replacement (meaning we can select the same row more than once) such that until we have a new data set that is the same size as our original one. If we calculate a new set of coefficients on this new data set, they should be similar but slightly different. The amount by which the coefficients vary can then be used to infer the uncertainty related with that coefficient. For this experiment I am going to use 25,000 resamples, for this I am going to create a variable N_SAMPLE to save this. This code can take a long time to run and so if you are recreating this you may wish to try with fewer resamples.

# Setting the seed so that results can be recreated
set.seed(123)
# Setting the number of resamples
N_SAMPLE = 25000

I am then going to create a matrix to store the value of each coefficient, I am also going to create a matrix that will store the rank of each variable according to how important it is.

coefs <- matrix(0, ncol = ncol(data), nrow = N_SAMPLE)
ranks <- matrix(0, ncol = ncol(data), nrow = N_SAMPLE)

I then perform the bootstrapping

for (i in 1:N_SAMPLE) {
data_samples <- data[sample(1:nrow(data), nrow(data), replace = TRUE),]
mod <- lm(medv ~ ., data = data_samples)
coefs[i,] <- mod$coefficients ranks[i,] <- -abs(mod$coefficients) %>% rank(ties.method = "random")
}

colnames(coefs) <- names(mod$coefficients) colnames(ranks) <- names(mod$coefficients)

We can then plot the distribution of these coefficients

By doing this, you can see just how wide the range of these coefficients are under perturbation. Not only this, but some variables that are quite clearly o one side of the 0 (e.g. crim) have multiple coefficients that cross 0 changing the way the variable will be interpretted. This is not to be confused with the ceoffecients centered around 0, these too have their own problems though, with many non-zero coefficients seemingly implying relationships that are not present. This might not be a problem in and of itself, but we can see that the ranges overlap quite substantially, so if we look at the rank of each of these coefficients under resampling we can see that varies wildly too.

This shows that which variables are more important varies substantially under small perturbations of the underlying data. This alone was evidence enough for me that I should stop using the coefficients of linear regression as it can’t even consistently rank the importance of variables.

## Why is this happening?

There are multiple reasons why this is happening linked to the resampling of the data. But the main reason is correlation in the underlying data.

## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.

As we can see from this correlation plot, the variables covary a lot. This can make it hard for linear regression to determine which variable is contributing to a change as they are all varying together. This isn’t necessarily a problem for prediction ut is a big problem for inference. It means that if a variable is truly important, it might not have a large coefficient but something it is correlated with does.

This correlation leads to another problem too, as the way in which we are interpretting the system is by definition at odds with the way in which we are interpretting the model. By reading coefficients we are asking how will the prediction change if this variable changes and nothing else does?. But within the system we know that variables never change independently and a lot of the time the idea of one variable changing whilst another stays the same is nonsensical.

## Conclusions

This is a very rough overview of some of the many issues reading coefficients of linear models. It is still something I might do as part of exploratory data analysis or a quick sanity check, but it is now something that I am trying to move away from. In any case where I have to do it I try and keep these ideas in mind and I am very careful about any inferential statements that I make.

##### Hugh Warden
###### PhD Student - Human Genetics Unit

I use machine learning and cell painting to look at how cancer affects the morphology of cells.