December 2, 2021
\[y_i = \alpha + \beta x_i + \epsilon_i\]
\[ \begin{aligned} y_i &\sim \text{Normal}(\mu_i, \sigma^2) \\ \mu_i &= \alpha + \beta \times x_i \end{aligned} \]
\(\sim\) means “distributed”
So, \(y\) is normally distributed with a mean of \(\mu\) and a constant variance of \(\sigma^2\)
We want to estimate the average outcome, which is why we “model” the mean
To model the mean of our outcome, we use some linear equation: \(\mu = \alpha + \beta * x\)
So,
\[ \begin{aligned} y_i &\sim \text{Normal}(\mu_i, \sigma^2) \\ \mu_i &= \alpha + \beta \times x_i \end{aligned} \]
is the more complete way to express
\[y = \alpha + \beta x_i + \epsilon_i\]
because it tells us what likelihood and link (more on this later) functions we are using
Likelihood functions are essentially probability models of our data
They describe the data generating process
Ultimately the goal of statistics is to build a model that can make accurate predictions and describe data
We do this by choosing likelihood functions that describe observed data
As an example, we’re going to use data on flights in the US:
## # A tibble: 6 × 23 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 10 31 1824 1829 -5 1956 2011 ## 2 2013 3 11 803 723 40 1109 1020 ## 3 2013 8 21 2141 2100 41 12 2336 ## 4 2013 1 7 1731 1727 4 2117 2049 ## 5 2013 5 5 1210 1146 24 1431 1404 ## 6 2013 3 30 1921 1930 -9 2032 2113 ## # … with 15 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, ## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm>, flight_delayed <chr>, ## # delayed <dbl>, distance_c <dbl>, distance_2sd <dbl>
Let’s try to see if the distance of a flight can predict whether the flight is delayed or not
What probability model would describe flight delays? The normal distribution?
The normal distribution doesn’t sound right to me
The normal distribution describes a data generating process that produces continuous values from \(-\infty\) to \(\infty\)
Let’s run with it anyway
normal.fit <- glm(delayed ~ distance_2sd, family = gaussian(), data = flights)
The normal distribution doesn’t model the data very well
We need a likelihood function that describes a data generating process that produces discrete values of either 0 or 1
The binomial and Bernoulli distributions both describe that process
\[ \begin{aligned} y_i &\sim \text{Binomial}(1, p_i) \\ \text{logit}(p_i) &= \alpha + \beta \times x_i \end{aligned} \]
What is our likelihood function?
What is our link function?
Link functions are used to map the range of a linear equation to the range of the outcome variable
The logit link function restricts the range of the linear equation to fall between 0 and 1
The log link function restricts the range of the linear equation to be positive, continuous, and greater than 0
Our models always use the same linear structure (\(\mu_i = \alpha + \beta x_i\))
But, we can use any likelihood and link function that we want
With these two pieces, we can generalize our linear equation to model any type of data
Hence, the generalized linear model
It always has a linear equation, but it doesn’t always have a normal distribution!
\[ \begin{aligned} y_i &\sim \text{Binomial}(1, p_i) \\ \text{logit}(p_i) &= \alpha + \beta \times x_i \end{aligned} \]
binomial.fit <- glm(delayed ~ distance_2sd, family = binomial(link = "logit"), data = flights)
All you have to do is “undo” the link function
Logit link function: \(\ln \left( \frac{P}{1 - P} \right)\)
Inverse logit link function: \(\frac{1}{1 + \exp^{(P)}}\)
Inverse logit link function in R: plogis()
## (Intercept) distance_2sd ## -0.3022855 0.3921170
plogis(coef(binomial.fit))
## (Intercept) distance_2sd ## 0.4249989 0.5967922
A one unit increase in distance_2sd
increases the relative probability of having a delayed flight by about 60%
GLMs can be used to model any type of data (just about)
Pick a realistic likelihood function, then pick a link function
We want to model yard breaks as a function of the type of wool and the tension applied.
What kind of a distribution do we need?
Continuous or discrete?
Positive only, negative only, or all numbers?
Poisson is discrete and positive only
What kind of link function?
\[ \begin{aligned} y_i &\sim \text{Poisson}(\lambda_i) \\ \text{log}(\lambda_i) &= \alpha + \beta \times x_i \end{aligned} \]
poisson.fit <- glm(breaks ~ wool + tension, family = poisson(link = "log"), data = warpbreaks)
What if we just used a normal likelihood function?
Now with a Poisson likelihood:
How would we interpret the coefficients from a Poisson regression?
## (Intercept) woolB tensionM tensionH ## 40.1235380 0.8138425 0.7251908 0.5954198
The coefficient was negative, so it decreases breaks
Compared to wool A, the number of breaks for woolB
is predicted to be \((1-0.813) * 100\) 18.7% lower.