Intro to Generalized Linear Models

December 2, 2021

The General Linear Model

Forget generalizing it, what is the general linear model?

The general linear model is just the same basic linear regression model you know and love.

\[y = \alpha + \beta x_i + \epsilon\]

But this way of writing regression equations hides a lot of assumptions

Revealing the assumptions of your model

To see what we’re hiding, let’s write this same model a different way

\[ \begin{aligned} y &\sim \text{Normal}(\mu_i, 0) \\ \mu_i &= \alpha + \beta \times x_i \end{aligned} \]

\(\sim\) means “distributed”
So, \(y\) is normally distributed with a mean of \(\mu\) and a standard deviation of 0
We want to estimate the average outcome, which is why we “model” the mean
To model the mean of our outcome, we use some linear equation: \(\mu = \alpha + \beta * x\)

Revealing the assumptions of your model

So,

\[ \begin{aligned} y &\sim \text{Normal}(\mu_i, 0) \\ \mu_i &= \alpha + \beta \times x_i \end{aligned} \]

is the more complete way to express

\[y = \alpha + \beta x_i + \epsilon\]

because it tells us what likelihood and link (more on this later) functions we are using

What are likelihood functions, again?

Likelihood functions are essentially probability models of our data
They describe the data generating process
Ultimately the goal of statistics is to build a model that can make accurate predictions and describe data
We do this by choosing likelihood functions that describe observed data

Generalizing the General Linear Model

What if the normal distribution doesn’t describe our data?

As an example, we’re going to use data on flights in the US:

## # A tibble: 6 × 23
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     9     5      621            625        -4      745            800
## 2  2013     9    11     1832           1830         2     2200           2122
## 3  2013     7    31      628            630        -2      755            755
## 4  2013    10     4     1200           1200         0     1309           1315
## 5  2013     3     7     1022           1001        21     1149           1129
## 6  2013    12     8     1758           1800        -2     1942           1919
## # … with 15 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>, flight_delayed <chr>,
## #   delayed <dbl>, distance_c <dbl>, distance_2sd <dbl>

What if the normal distribution doesn’t describe our data?

Let’s try to see if the distance of a flight can predict whether the flight is delayed or not
What probability model would describe flight delays? The normal distribution?

Modeling Flight Delays

The normal distribution doesn’t sound right to me
The normal distribution describes a data generating process that produces continuous values from \(-\infty\) to \(\infty\)
Let’s run with it anyway

normal.fit <- glm(delayed ~ distance_2sd,
                  family = gaussian(),
                  data = flights)

Modeling Flight Delays

The normal distribution doesn’t model the data very well
We need a likelihood function that describes a data generating process that produces discrete values of either 0 or 1
The binomial and Bernoulli distributions both describe that process

Modeling Flight Delays

\[ \begin{aligned} y_i &\sim \text{Binomial}(1, p_i) \\ \text{logit}(p_i) &= \alpha + \beta \times x \end{aligned} \]

What is our likelihood function?
What is our link function?

Wait, what is a link function?

Link functions are used to map the range of a linear equation to the range of the outcome variable
The logit link function restricts the range of the linear equation to be either 0 or 1
The log link function restricts the range of the linear equation to be positive, continuous, and greater than 0

Connecting the Dots

Our models always use the same linear structure (\(\mu_i = \alpha + \beta x_i\))
But, we can use any likelihood and link function that we want
With these two pieces, we can generalize our linear equation to model any type of data
Hence, the generalized linear model
It always has a linear equation, but it doesn’t always have a normal distribution!

Modeling Flight Delays

Let’s model flight delays with a more realistic likelihood function - the binomial

\[ \begin{aligned} y_i &\sim \text{Binomial}(1, p_i) \\ \text{logit}(p_i) &= \alpha + \beta \times x \end{aligned} \]

binomial.fit <- glm(delayed ~ distance_2sd,
                    family = binomial(link = "logit"),
                    data = flights)

Modeling Flight Delays

How do we intrepret models with different likelihood and link functions?

All you have to do is “undo” the link function
Logit link function: \(\ln \left( \frac{P}{1 - P} \right)\)
Inverse logit link function: \(\frac{1}{1 + \exp^{(P)}}\)
Inverse logit link function in R: plogis()

How do we intrepret models with different likelihood and link functions?

##  (Intercept) distance_2sd 
##   -0.3529188    0.2009372

plogis(coef(binomial.fit))

##  (Intercept) distance_2sd 
##    0.4126748    0.5500660

A one unit increase in distance_2sd increases the relative probability of having a delayed flight by about 55%

The Flexibility of GLMs

GLMs can be used to model any type of data (just about)
Pick a realistic likelihood function, then pick a link function

Modeling Test

We want to model yard breaks as a function of the type of wool and the tension applied.

Modeling Test

What kind of a distribution do we need?

Continuous or discrete?
Positive only, negative only, or all numbers?

Modeling Test

Poisson is discrete and positive only
What kind of link function?

\[ \begin{aligned} y_i &\sim \text{Poisson}(\lambda_i) \\ \text{log}(\lambda_i) &= \alpha + \beta \times x \end{aligned} \]

poisson.fit <- glm(breaks ~ wool + tension,
                   family = poisson(link = "log"),
                   data = warpbreaks)

Modeling Test

What if we just used a normal likelihood function?

Modeling Test

Now with a Poisson likelihood:

Modeling Test

How would we interpret the coefficients from a Poisson regression?

## (Intercept)       woolB    tensionM    tensionH 
##  40.1235380   0.8138425   0.7251908   0.5954198

The coefficient was negative, so it decreases breaks
Compared to wool A, the number of breaks for woolB is predicted to be \((1-0.813) * 100\) 18.7% lower.

The General Linear Model

Forget generalizing it, what is the general linear model?

Revealing the assumptions of your model

Revealing the assumptions of your model

What are likelihood functions, again?

Generalizing the General Linear Model

What if the normal distribution doesn’t describe our data?

What if the normal distribution doesn’t describe our data?

Modeling Flight Delays

Modeling Flight Delays

Modeling Flight Delays

Modeling Flight Delays

Modeling Flight Delays

Wait, what is a link function?

Connecting the Dots

Modeling Flight Delays

Modeling Flight Delays

How do we intrepret models with different likelihood and link functions?

How do we intrepret models with different likelihood and link functions?

The Flexibility of GLMs

Modeling Test

Modeling Test

Modeling Test

Modeling Test

Modeling Test

Modeling Test

Modeling Test

Let’s practice!