December 2, 2021
\[y = \alpha + \beta x_i + \epsilon\]
\[ \begin{aligned} y &\sim \text{Normal}(\mu_i, 0) \\ \mu_i &= \alpha + \beta \times x_i \end{aligned} \]
\(\sim\) means “distributed”
So, \(y\) is normally distributed with a mean of \(\mu\) and a standard deviation of 0
We want to estimate the average outcome, which is why we “model” the mean
To model the mean of our outcome, we use some linear equation: \(\mu = \alpha + \beta * x\)
So,
\[ \begin{aligned} y &\sim \text{Normal}(\mu_i, 0) \\ \mu_i &= \alpha + \beta \times x_i \end{aligned} \]
is the more complete way to express
\[y = \alpha + \beta x_i + \epsilon\]
because it tells us what likelihood and link (more on this later) functions we are using
Likelihood functions are essentially probability models of our data
They describe the data generating process
Ultimately the goal of statistics is to build a model that can make accurate predictions and describe data
We do this by choosing likelihood functions that describe observed data
As an example, we’re going to use data on flights in the US:
## # A tibble: 6 × 23 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 9 5 621 625 -4 745 800 ## 2 2013 9 11 1832 1830 2 2200 2122 ## 3 2013 7 31 628 630 -2 755 755 ## 4 2013 10 4 1200 1200 0 1309 1315 ## 5 2013 3 7 1022 1001 21 1149 1129 ## 6 2013 12 8 1758 1800 -2 1942 1919 ## # … with 15 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, ## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm>, flight_delayed <chr>, ## # delayed <dbl>, distance_c <dbl>, distance_2sd <dbl>
Let’s try to see if the distance of a flight can predict whether the flight is delayed or not
What probability model would describe flight delays? The normal distribution?
The normal distribution doesn’t sound right to me
The normal distribution describes a data generating process that produces continuous values from \(-\infty\) to \(\infty\)
Let’s run with it anyway
normal.fit <- glm(delayed ~ distance_2sd, family = gaussian(), data = flights)
The normal distribution doesn’t model the data very well
We need a likelihood function that describes a data generating process that produces discrete values of either 0 or 1
The binomial and Bernoulli distributions both describe that process
\[ \begin{aligned} y_i &\sim \text{Binomial}(1, p_i) \\ \text{logit}(p_i) &= \alpha + \beta \times x \end{aligned} \]
What is our likelihood function?
What is our link function?
Link functions are used to map the range of a linear equation to the range of the outcome variable
The logit link function restricts the range of the linear equation to be either 0 or 1
The log link function restricts the range of the linear equation to be positive, continuous, and greater than 0
Our models always use the same linear structure (\(\mu_i = \alpha + \beta x_i\))
But, we can use any likelihood and link function that we want
With these two pieces, we can generalize our linear equation to model any type of data
Hence, the generalized linear model
It always has a linear equation, but it doesn’t always have a normal distribution!
\[ \begin{aligned} y_i &\sim \text{Binomial}(1, p_i) \\ \text{logit}(p_i) &= \alpha + \beta \times x \end{aligned} \]
binomial.fit <- glm(delayed ~ distance_2sd, family = binomial(link = "logit"), data = flights)
All you have to do is “undo” the link function
Logit link function: \(\ln \left( \frac{P}{1 - P} \right)\)
Inverse logit link function: \(\frac{1}{1 + \exp^{(P)}}\)
Inverse logit link function in R: plogis()
## (Intercept) distance_2sd ## -0.3529188 0.2009372
plogis(coef(binomial.fit))
## (Intercept) distance_2sd ## 0.4126748 0.5500660
A one unit increase in distance_2sd
increases the relative probability of having a delayed flight by about 55%
GLMs can be used to model any type of data (just about)
Pick a realistic likelihood function, then pick a link function
We want to model yard breaks as a function of the type of wool and the tension applied.
What kind of a distribution do we need?
Continuous or discrete?
Positive only, negative only, or all numbers?
Poisson is discrete and positive only
What kind of link function?
\[ \begin{aligned} y_i &\sim \text{Poisson}(\lambda_i) \\ \text{log}(\lambda_i) &= \alpha + \beta \times x \end{aligned} \]
poisson.fit <- glm(breaks ~ wool + tension, family = poisson(link = "log"), data = warpbreaks)
What if we just used a normal likelihood function?
Now with a Poisson likelihood:
How would we interpret the coefficients from a Poisson regression?
## (Intercept) woolB tensionM tensionH ## 40.1235380 0.8138425 0.7251908 0.5954198
The coefficient was negative, so it decreases breaks
Compared to wool A, the number of breaks for woolB
is predicted to be \((1-0.813) * 100\) 18.7% lower.