Multicollinearity

Prof. Eric Friedlander

First things first

  • Finish-up AE-16

Announcements

  • Project: EDA Due Today
  • Project: Paper Due November 18th
  • Oral R Quiz

📋 AE 17 - Multicollinearity

  • Open up AE 17 and Complete Exercise 0

Topics

  • Defining Multicollinearity
  • Detecting Multicollinearity
  • Variance Inflation Factors

Computational setup

# load packages
library(tidyverse)
library(broom)
library(mosaic)
library(mosaicData)
library(patchwork)
library(knitr)
library(kableExtra)
library(scales)
library(countdown)
library(rms)

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_minimal(base_size = 16))

Data: rail_trail

  • The Pioneer Valley Planning Commission (PVPC) collected data for ninety days from April 5, 2005 to November 15, 2005.
  • Data collectors set up a laser sensor, with breaks in the laser beam recording when a rail-trail user passed the data collection station.
# A tibble: 90 × 7
   volume hightemp avgtemp season cloudcover precip day_type
    <dbl>    <dbl>   <dbl> <chr>       <dbl>  <dbl> <chr>   
 1    501       83    66.5 Summer       7.60 0      Weekday 
 2    419       73    61   Summer       6.30 0.290  Weekday 
 3    397       74    63   Spring       7.5  0.320  Weekday 
 4    385       95    78   Summer       2.60 0      Weekend 
 5    200       44    48   Spring      10    0.140  Weekday 
 6    375       69    61.5 Spring       6.60 0.0200 Weekday 
 7    417       66    52.5 Spring       2.40 0      Weekday 
 8    629       66    52   Spring       0    0      Weekend 
 9    533       80    67.5 Summer       3.80 0      Weekend 
10    547       79    62   Summer       4.10 0      Weekday 
# ℹ 80 more rows

Source: Pioneer Valley Planning Commission via the mosaicData package.

Full model

term estimate std.error statistic p.value
(Intercept) 17.622161 76.582860 0.2301058 0.8185826
hightemp 7.070528 2.420523 2.9210743 0.0045045
avgtemp -2.036685 3.142113 -0.6481896 0.5186733
seasonSpring 35.914983 32.992762 1.0885716 0.2795319
seasonSummer 24.153571 52.810486 0.4573632 0.6486195
cloudcover -7.251776 3.843071 -1.8869743 0.0627025
precip -95.696525 42.573359 -2.2478030 0.0272735
day_typeWeekend 35.903750 22.429056 1.6007696 0.1132738

Multicollinearity

What is multicollinearity

  • Multicollinearity is the case when one or more predictor variables are strongly correlated with some combination of other predictors
  • Intuition: if you could fit a good linear model with one of your predictors as the response and the rest of the predictors as your explanatory variables, then your predictors are exhibiting multicollinearity

Example

Let’s assume the true population regression equation is y=3+4x

Suppose we try estimating that equation using a model with variables x and z=x/10

y^=β^0+β^1x+β^2z=β^0+β^1x+β^2x10=β^0+(β^1+β^210)x

Example

y^=β^0+(β^1+β^210)x

  • We can set β^1 and β^2 to any two numbers such that β^1+β^210=4

  • Therefore, we are unable to choose the “best” combination of β^1 and β^2

  • In statistics, we say this model is “unidentifiable” because different parameters combinations can result in the same model

  • This is also why we need to set a reference level for categorical variables

  • Complete Exercises 1-2.

Why multicollinearity is a problem

  • When we have perfect collinearities, we are unable to get estimates for the coefficients

    • When we have almost perfect collinearities (i.e. highly correlated predictor variables), the standard errors for our regression coefficients inflate

    • In other words, we lose precision in our estimates of the regression coefficients

    • This impedes our ability to use the model for inference

    • It is also difficult to interpret the model coefficients

Detecting Multicollinearity

Multicollinearity may occur when…

  • There are very high correlations (r>0.9) among two or more predictor variables, especially when the sample size is small

  • One (or more) predictor variables is an almost perfect linear combination of the others

  • There are interactions between two or more continuous variables

Detecting multicollinearity in the EDA

  • Look at a correlation matrix of the predictor variables, including all indicator variables
    • Look out for values close to 1 or -1
  • Look at a scatter plot matrix of the predictor variables
    • Look out for plots that show a relatively linear relationship
  • Complete Exercises 3-4.

Detecting Multicollinearity (VIF)

Variance Inflation Factor (VIF): Measure of multicollinearity in the regression model

VIF(β^j)=11−RXj|X−j2

where RXj|X−j2 is the proportion of variation in Xj that is explained by the linear combination of the other explanatory variables in the model.

Detecting Multicollinearity (VIF)

  • Typically VIF>10 indicates concerning multicollinearity

  • Variables with similar values of VIF are typically the ones correlated with each other

  • Use the vif() function in the rms R package to calculate VIF

VIF for rail trail model

Complete Exercise 5.

vif(rt_full_fit)
       hightemp         avgtemp    seasonSpring    seasonSummer      cloudcover 
      10.259978       13.086175        2.751577        5.841985        1.587485 
         precip day_typeWeekend 
       1.295352        1.125741 


hightemp and avgtemp are correlated.

What to do about Multicollinearity

  1. Drop some predictors.
    • Example: Remove one of these variables and refit the model.
  2. Combine some predictors.
    • Example: Create a new variable temp_comsite that is the average of avgtemp and hightemp.
  3. Discount the individual coefficients and t-tests.
    • Example: Think about avgtemp and hightemp together with their individual β’s and p-values not having much meaning.

Complete Exercises 6 & 7.

Model without hightemp

term estimate std.error statistic p.value
(Intercept) 76.071 77.204 0.985 0.327
avgtemp 6.003 1.583 3.792 0.000
seasonSpring 34.555 34.454 1.003 0.319
seasonSummer 13.531 55.024 0.246 0.806
cloudcover -12.807 3.488 -3.672 0.000
precip -110.736 44.137 -2.509 0.014
day_typeWeekend 48.420 22.993 2.106 0.038

Model without avgtemp

term estimate std.error statistic p.value
(Intercept) 8.421 74.992 0.112 0.911
hightemp 5.696 1.164 4.895 0.000
seasonSpring 31.239 32.082 0.974 0.333
seasonSummer 9.424 47.504 0.198 0.843
cloudcover -8.353 3.435 -2.431 0.017
precip -98.904 42.137 -2.347 0.021
day_typeWeekend 37.062 22.280 1.663 0.100

Model without temp_composite

term estimate std.error statistic p.value
(Intercept) 18.823 77.430 0.243 0.809
seasonSpring 28.458 33.059 0.861 0.392
seasonSummer -0.986 51.234 -0.019 0.985
cloudcover -10.367 3.409 -3.041 0.003
precip -104.475 42.725 -2.445 0.017
day_typeWeekend 40.914 22.479 1.820 0.072
temp_composite 6.292 1.376 4.571 0.000

Choosing a model

Model without hightemp:

adj.r.squared AIC BIC
0.42 1087.5 1107.5

Model without avgtemp:

adj.r.squared AIC BIC
0.47 1079.05 1099.05

Model with temp_composite:

adj.r.squared AIC BIC
0.46 1081.67 1101.67

Based on Adjusted R2, AIC, and BIC, the model without avgtemp is a better fit. Therefore, we choose to remove avgtemp from the model and leave hightemp in the model to deal with the multicollinearity.

Selected model (for now)

term estimate std.error statistic p.value
(Intercept) 8.421 74.992 0.112 0.911
hightemp 5.696 1.164 4.895 0.000
seasonSpring 31.239 32.082 0.974 0.333
seasonSummer 9.424 47.504 0.198 0.843
cloudcover -8.353 3.435 -2.431 0.017
precip -98.904 42.137 -2.347 0.021
day_typeWeekend 37.062 22.280 1.663 0.100

🔗 MAT 212 - Winter 2025 - Schedule

Multicollinearity Prof. Eric Friedlander

  1. Slides

  2. Tools

  3. Close
  • Multicollinearity
  • First things first
  • Announcements
  • Topics
  • Computational setup
  • Data: rail_trail
  • Full model
  • Multicollinearity
  • What is multicollinearity
  • Example
  • Example
  • Why multicollinearity is a problem
  • Detecting Multicollinearity
  • Detecting multicollinearity in the EDA
  • Detecting Multicollinearity (VIF)
  • Detecting Multicollinearity (VIF)
  • VIF for rail trail model
  • What to do about Multicollinearity
  • Model without hightemp
  • Model without avgtemp
  • Model without temp_composite
  • Choosing a model
  • Selected model (for now)
  • f Fullscreen
  • s Speaker View
  • o Slide Overview
  • e PDF Export Mode
  • r Scroll View Mode
  • b Toggle Chalkboard
  • c Toggle Notes Canvas
  • d Download Drawings
  • ? Keyboard Help