December 2, 2021

The General Linear Model

Forget generalizing it, what is the general linear model?

  • The general linear model is just the same basic linear regression model you know and love.

\[y_i = \alpha + \beta x_i + \epsilon_i\]

  • But this way of writing regression equations hides a lot of assumptions

Revealing the assumptions of your model

  • To see what we’re hiding, let’s write this same model a different way

\[ \begin{aligned} y_i &\sim \text{Normal}(\mu_i, \sigma^2) \\ \mu_i &= \alpha + \beta \times x_i \end{aligned} \]

  • \(\sim\) means “distributed”

  • So, \(y\) is normally distributed with a mean of \(\mu\) and a constant variance of \(\sigma^2\)

  • We want to estimate the average outcome, which is why we “model” the mean

  • To model the mean of our outcome, we use some linear equation: \(\mu = \alpha + \beta * x\)

Revealing the assumptions of your model

So,

\[ \begin{aligned} y_i &\sim \text{Normal}(\mu_i, \sigma^2) \\ \mu_i &= \alpha + \beta \times x_i \end{aligned} \]

is the more complete way to express

\[y = \alpha + \beta x_i + \epsilon_i\]

because it tells us what likelihood and link (more on this later) functions we are using

What are likelihood functions, again?

  • Likelihood functions are essentially probability models of our data

  • They describe the data generating process

  • Ultimately the goal of statistics is to build a model that can make accurate predictions and describe data

  • We do this by choosing likelihood functions that describe observed data

Generalizing the General Linear Model

What if the normal distribution doesn’t describe our data?

As an example, we’re going to use data on flights in the US:

## # A tibble: 6 Ă— 23
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013    10    31     1824           1829        -5     1956           2011
## 2  2013     3    11      803            723        40     1109           1020
## 3  2013     8    21     2141           2100        41       12           2336
## 4  2013     1     7     1731           1727         4     2117           2049
## 5  2013     5     5     1210           1146        24     1431           1404
## 6  2013     3    30     1921           1930        -9     2032           2113
## # … with 15 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>, flight_delayed <chr>,
## #   delayed <dbl>, distance_c <dbl>, distance_2sd <dbl>

What if the normal distribution doesn’t describe our data?

  • Let’s try to see if the distance of a flight can predict whether the flight is delayed or not

  • What probability model would describe flight delays? The normal distribution?

Modeling Flight Delays

Modeling Flight Delays

  • The normal distribution doesn’t sound right to me

  • The normal distribution describes a data generating process that produces continuous values from \(-\infty\) to \(\infty\)

  • Let’s run with it anyway

normal.fit <- glm(delayed ~ distance_2sd,
                  family = gaussian(),
                  data = flights)

Modeling Flight Delays

Modeling Flight Delays

  • The normal distribution doesn’t model the data very well

  • We need a likelihood function that describes a data generating process that produces discrete values of either 0 or 1

  • The binomial and Bernoulli distributions both describe that process

Modeling Flight Delays

\[ \begin{aligned} y_i &\sim \text{Binomial}(1, p_i) \\ \text{logit}(p_i) &= \alpha + \beta \times x_i \end{aligned} \]

  • What is our likelihood function?

  • What is our link function?

Wait, what is a link function?

Connecting the Dots

  • Our models always use the same linear structure (\(\mu_i = \alpha + \beta x_i\))

  • But, we can use any likelihood and link function that we want

  • With these two pieces, we can generalize our linear equation to model any type of data

  • Hence, the generalized linear model

  • It always has a linear equation, but it doesn’t always have a normal distribution!

Modeling Flight Delays

  • Let’s model flight delays with a more realistic likelihood function - the binomial

\[ \begin{aligned} y_i &\sim \text{Binomial}(1, p_i) \\ \text{logit}(p_i) &= \alpha + \beta \times x_i \end{aligned} \]

binomial.fit <- glm(delayed ~ distance_2sd,
                    family = binomial(link = "logit"),
                    data = flights)

Modeling Flight Delays

How do we intrepret models with different likelihood and link functions?

How do we intrepret models with different likelihood and link functions?

The Flexibility of GLMs

  • GLMs can be used to model any type of data (just about)

  • Pick a realistic likelihood function, then pick a link function

Modeling Test

We want to model yard breaks as a function of the type of wool and the tension applied.

Modeling Test

Modeling Test

What kind of a distribution do we need?

  • Continuous or discrete?

  • Positive only, negative only, or all numbers?

Modeling Test

  • Poisson is discrete and positive only

  • What kind of link function?

\[ \begin{aligned} y_i &\sim \text{Poisson}(\lambda_i) \\ \text{log}(\lambda_i) &= \alpha + \beta \times x_i \end{aligned} \]

poisson.fit <- glm(breaks ~ wool + tension,
                   family = poisson(link = "log"),
                   data = warpbreaks)

Modeling Test

What if we just used a normal likelihood function?

Modeling Test

Now with a Poisson likelihood:

Modeling Test

How would we interpret the coefficients from a Poisson regression?

## (Intercept)       woolB    tensionM    tensionH 
##  40.1235380   0.8138425   0.7251908   0.5954198
  • The coefficient was negative, so it decreases breaks

  • Compared to wool A, the number of breaks for woolB is predicted to be \((1-0.813) * 100\) 18.7% lower.

Let’s practice!