MAT 212 - Winter 2025 – MLR: Inference and Conditions

Review: Simple linear regression (SLR)

gf_point(volume ~ avgtemp, data = rail_trail, alpha = 0.5) |>
  gf_lm()  |>
  gf_labs(x = "rail_trail avgtemp", y = "avgtemp")

SLR model summary

avgtemp_slr_fit <- lm(volume ~ avgtemp, data = rail_trail)

tidy(avgtemp_slr_fit) |> kable()

term	estimate	std.error	statistic	p.value
(Intercept)	99.60227	63.473587	1.569192	0.1201920
avgtemp	4.80205	1.084499	4.427899	0.0000272

SLR hypothesis test

term	estimate	std.error	statistic	p.value
(Intercept)	99.60227	63.473587	1.569192	0.1201920
avgtemp	4.80205	1.084499	4.427899	0.0000272

Set hypotheses: $H_{0} : β_{1} = 0$ vs. $H_{A} : β_{1} \neq 0$

Calculate test statistic and p-value: The test statistic is $t = 4.43$ . The p-value is calculated using a $t$ distribution with 88 degrees of freedom. The p-value is $2.72 \times 10^{- 5}$ .

State the conclusion: The p-value is small, so we reject $H_{0}$ . The data provide strong evidence that avgtemp is a helpful predictor for a rail_trail card holder’s rail_trail volume, i.e. there is a linear relationship between avgtemp and rail_trail volume.

Multiple linear regression

rail_trail_fit <- lm(volume ~ avgtemp + hightemp, data = rail_trail)

tidy(rail_trail_fit) |> kable()

term	estimate	std.error	statistic	p.value
(Intercept)	1.667237	56.424299	0.0295482	0.9764951
avgtemp	-7.942489	2.346235	-3.3852060	0.0010689
hightemp	12.056606	2.041231	5.9065368	0.0000001

Multiple linear regression

The multiple linear regression model assumes $Y | X_{1}, X_{2}, \dots, X_{p} \sim N (β_{0} + β_{1} X_{1} + β_{2} X_{2} + \dots + β_{p} X_{p}, σ_{ϵ}^{2})$

For a given observation $(x_{i 1}, x_{i 2}, \dots, x_{i p}, y_{i})$ , we can rewrite the previous statement as

$y_{i} = β_{0} + β_{1} x_{i 1} + β_{2} x_{i 2} + \dots + β_{p} x_{i p} + ϵ_{i}, ϵ_{i} \sim N (0, σ_{ϵ}^{2})$

Estimating $σ_{ϵ}$

For a given observation $(x_{i 1}, x_{i 2}, \dots, x_{i p}, y_{i})$ the residual is $\begin{aligned} e_{i} & = y_{i} - \hat{y_{i}} \\ = y_{i} - ({\hat{β}}_{0} + {\hat{β}}_{1} x_{i 1} + {\hat{β}}_{2} x_{i 2} + \dots + {\hat{β}}_{p} x_{i p}) \end{aligned}$

The estimated value of the regression standard error , $σ_{ϵ}$ , is

${\hat{σ}}_{ϵ} = \sqrt{\frac{\sum_{i = 1}^{n} e_{i}^{2}}{n - p - 1}}$

As with SLR, we use ${\hat{σ}}_{ϵ}$ to calculate $S E_{{\hat{β}}_{j}}$ , the standard error of each coefficient. See Matrix Form of Linear Regression for more detail.

MLR hypothesis test: avgtemp

Set hypotheses: $H_{0} : β_{a v g t e m p} = 0$ vs. $H_{A} : β_{a v g t e m p} \neq 0$ , given hightemp is in the model

Calculate test statistic and p-value: The test statistic is $t = - 3.39$ . The p-value is calculated using a $t$ distribution with $(n - p - 1) = 90 - 2 - 1 = 87$ degrees of freedom. The p-value is $\approx 0.0011$ .
- Note that $p$ counts all non-intercept $β$ ’s

State the conclusion: The p-value is small, so we reject $H_{0}$ . The data provides convincing evidence that a avgtemp is a useful predictor in a model that already contains rail_trail hightemp as a predictor for the rail_trail volume.

MLR hypothesis test: interaction terms

Framework: Same as previous slide except use $β$ of interaction term
$p$ (the number of predictors) should include the interaction term
Conclusion: tells you whether interaction term is a useful predictor
Warning: if $X_{1}$ and $X_{2}$ have an interaction term, don’t try to interpret their individual p-values… if interaction term is significant, then both variables are important

Complete Exercises 3 and 4.

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	1.67	56.42	0.03	0.98	-110.48	113.82
avgtemp	-7.94	2.35	-3.39	0.00	-12.61	-3.28
hightemp	12.06	2.04	5.91	0.00	8.00	16.11

term	estimate	std.error	statistic	p.value
(Intercept)	17.622161	76.582860	0.2301058	0.8185826
hightemp	7.070528	2.420523	2.9210743	0.0045045
avgtemp	-2.036685	3.142113	-0.6481896	0.5186733
seasonSpring	35.914983	32.992762	1.0885716	0.2795319
seasonSummer	24.153571	52.810486	0.4573632	0.6486195
cloudcover	-7.251776	3.843071	-1.8869743	0.0627025
precip	-95.696525	42.573359	-2.2478030	0.0272735
day_typeWeekend	35.903750	22.429056	1.6007696	0.1132738

MLR: Inference and Conditions

Application Exercise

Topics

Computational setup

Data: `rail_trail`

Visualizing the data

Conduct a hypothesis test for $β_{j}$

Review: Simple linear regression (SLR)

SLR model summary

SLR hypothesis test

Multiple linear regression

Multiple linear regression

Estimating $σ_{ϵ}$

MLR hypothesis test: avgtemp

MLR hypothesis test: interaction terms

Confidence interval for $β_{j}$

Confidence interval for $β_{j}$

Confidence interval for $β_{j}$

Inference pitfalls

Large sample sizes

Small sample sizes

Effect Size and Power

Conditions for inference

Model conditions

Checking Linearity

Fitting the Full Model

Residuals vs. predicted values

Checking constant variance

Residuals vs. each predictor

Checking linearity

Checking normality

Checking independence

Recap