Use simple linear regression to describe the relationship between a quantitative predictor and quantitative response variable.
Estimate the slope and intercept of the regression line using the least squares method.
Interpret the slope and intercept of the regression line.
Use R to fit and summarize regression models.
Computation set up
# load packageslibrary(tidyverse) # for data wranglinglibrary(ggformula) # for plottinglibrary(broom) # for formatting model outputlibrary(knitr) # for formatting tables# set default theme and larger font size for ggplot2ggplot2::theme_set(ggplot2::theme_bw(base_size =16))# set default figure parameters for knitrknitr::opts_chunk$set(fig.width =8,fig.asp =0.618,fig.retina =3,dpi =300,out.width ="80%")
Data
DC Bikeshare
Our data set contains daily rentals from the Capital Bikeshare in Washington, DC in 2011 and 2012. It was obtained from the dcbikeshare data set in the dsbox R package.
We will focus on the following variables in the analysis:
Click here for the full list of variables and definitions.
Let’s complete Exercises 2-6 together
Data prep
Exercise 2: Recode season as a factor with names instead of numbers (livecode)
Remember:
Think of |> as “and then”
mutate creates new columns and changes (mutates) existing columns
R calls categorical data “factors”
bikeshare <-read_csv("../data/dcbikeshare.csv") |>mutate(season =case_when( season ==1~"winter", season ==2~"spring", season ==3~"summer", season ==4~"fall" ),season =factor(season))
Exploratory data analysis (Exercise 3)
gf_point(count ~ temp_orig | season, data = bikeshare) |>gf_labs(x ="Temperature (Celsius)",y ="Daily bike rentals")
Exploratory data analysis (Exercise 3)
gf_point(count ~ temp_orig | season, data = bikeshare) |>gf_labs(x ="Temperature (Celsius)",y ="Daily bike rentals")
More data prep
(Exercise 5) Filter your data for the season with the strongest relationship and give the resulting data set a new name
winter <- bikeshare |>filter(season =="winter")
Rentals vs Temperature
Goal: Fit a line to describe the relationship between the temperature and the number of rentals in winter.
Why fit a line?
We fit a line to accomplish one or both of the following:
Prediction
How many rentals are expected when it’s 10 degrees out?
Inference
Is temperature a useful predictor of the number of rentals? By how much is the number of rentals expected to change for each degree Celsius?
Population vs. Sample
Population: The set of items or events that you’re interested in and hoping (able) to generalize the results of your analysis to.
Sample: The set of items that you have data for.
Representative Sample: A sample that looks like a small version of your population.
Goal: Build a model from your sample which generalizes to your population.
Terminology
Response, Y: variable describing the outcome of interest
Predictor, X: variable we use to help understand the variability in the response
Regression model
Regression model: a function that describes the relationship between a quantitative response, Y, and the predictor, X (or many predictors).
Y=Model+Error=f(X)+ϵ=μY|X+ϵ
Regression model
Y=Model+Error=f(X)+ϵ=μY|X+ϵ
μY|X is the mean value of Y given a particular value of X.
Regression model
Y=Model+Error=f(X)+ϵ=μY|X+ϵ
Simple linear regression (SLR)
SLR: Statistical model
Simple linear regression: model to describe the relationship between Y and X where:
Y is a quantitative/numerical response
X is a single quantitative predictor
Y=β0+β1X+ϵ
β1: True slope of the relationship between X and Y
β0: True intercept of the relationship between X and Y
ϵ: Error
SLR: Regression equation
ˆY=ˆβ0+ˆβ1X
ˆβ1: Estimated slope of the relationship between X and Y
ˆβ0: Estimated intercept of the relationship between X and Y
ˆY: Predicted value of Y for a given X
No error term!
Choosing values for ˆβ1 and ˆβ0
Residuals
residual=observed−predicted=yi−ˆyi
Least squares line
Residual for the ith observation:
ei=observed−predicted=yi−ˆyi
Sum of squared residuals:
e21+e22+⋯+e2n
Least squares line is the one that minimizes the sum of squared residuals
Slope and intercept
Properties of least squares regression
Passes through center of mass point, the coordinates corresponding to average X and average Y: ˆβ0=ˉY−ˆβ1ˉX
Slope has same sign as the correlation coefficient: ˆβ1=rsYsX
r: correlation coefficient
sY,sX: sample standard deviations of X and Y
Sum of the residuals is zero: ∑ni=1ei≈0
Intuition: Residuals are “balanced”
The residuals and X values are uncorrelated
Estimating the slope
ˆβ1=rsYsX
sX=4.2121sY=1399.942r=0.6692
ˆβ1=0.6692×1399.9424.2121=222.417
Click here for details on deriving the equations for slope and intercept which is easy if you know multivariate calculus.
Estimating the intercept
ˆβ0=ˉY−ˆβ1ˉX
ˉx=12.2076ˉy=2604.133ˆβ1=222.4167
ˆβ0=2604.133−222.4167×12.2076=−111.0411
Click here for details on deriving the equations for slope and intercept.
Interpretation
Slope: For each additional unit of X we expect the Y to increase by ˆβ1, on average.
Intercept: If X were 0, we predict Y to be ˆβ0
Does it make sense to interpret the intercept?
✅ The intercept is meaningful in the context of the data if
the predictor can feasibly take values equal to or near zero, or
there are values near zero in the observed data.
🛑 Otherwise, the intercept may not be meaningful!
Estimating the regression line in R
Let’s complete Exercises 7-11
Fit model & estimate parameters
winter_fit <-lm(count ~ temp_orig, data = winter)winter_fitwinter_fit <-lm(count ~ temp_orig, data = winter)winter_fit
# create a data frame for a new temperaturenew_day <-tibble(temp_orig =15)# predict the outcome for a new daypredict(winter_fit, new_day)# create a data frame for a new temperaturenew_day <-tibble(temp_orig =15)# predict the outcome for a new daypredict(winter_fit, new_day)# create a data frame for a new temperaturenew_day <-tibble(temp_orig =15)# predict the outcome for a new daypredict(winter_fit, new_day)
1
3225.195
Complete Exercises 12-13.
Recap
Used simple linear regression to describe the relationship between a quantitative predictor and quantitative response variable.
Used the least squares method to estimate the slope and intercept.
Interpreted the slope and intercept.
Slope: For every one unit increase in x, we expect y to change by ˆβ1 units, on average.
Intercept: If x is 0, then we expect y to be ˆβ0 units
Predicted the response given a value of the predictor variable.
Used lm and the broom package to fit and summarize regression models in R.