December 2, 2021

- The general linear model is just the same basic linear regression model you know and love.

\[y_i = \alpha + \beta x_i + \epsilon_i\]

- But this way of writing regression equations hides a lot of assumptions

- To see what weâ€™re hiding, letâ€™s write this same model a different way

\[ \begin{aligned} y_i &\sim \text{Normal}(\mu_i, \sigma^2) \\ \mu_i &= \alpha + \beta \times x_i \end{aligned} \]

\(\sim\) means “distributed”

So, \(y\) is normally distributed with a mean of \(\mu\) and a constant variance of \(\sigma^2\)

We want to estimate the

*average*outcome, which is why we “model” the meanTo model the mean of our outcome, we use some linear equation: \(\mu = \alpha + \beta * x\)

So,

\[ \begin{aligned} y_i &\sim \text{Normal}(\mu_i, \sigma^2) \\ \mu_i &= \alpha + \beta \times x_i \end{aligned} \]

is the more complete way to express

\[y = \alpha + \beta x_i + \epsilon_i\]

because it tells us what likelihood and link (more on this later) functions we are using

Likelihood functions are essentially probability models of our data

They describe the data generating process

Ultimately the goal of statistics is to build a model that can make accurate predictions and describe data

We do this by choosing likelihood functions that describe observed data

As an example, weâ€™re going to use data on flights in the US:

## # A tibble: 6 Ã— 23 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 10 31 1824 1829 -5 1956 2011 ## 2 2013 3 11 803 723 40 1109 1020 ## 3 2013 8 21 2141 2100 41 12 2336 ## 4 2013 1 7 1731 1727 4 2117 2049 ## 5 2013 5 5 1210 1146 24 1431 1404 ## 6 2013 3 30 1921 1930 -9 2032 2113 ## # â€¦ with 15 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, ## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm>, flight_delayed <chr>, ## # delayed <dbl>, distance_c <dbl>, distance_2sd <dbl>

Letâ€™s try to see if the distance of a flight can predict whether the flight is delayed or not

What probability model would describe flight delays? The normal distribution?

The normal distribution doesnâ€™t sound right to me

The normal distribution describes a data generating process that produces continuous values from \(-\infty\) to \(\infty\)

Letâ€™s run with it anyway

normal.fit <- glm(delayed ~ distance_2sd, family = gaussian(), data = flights)

The normal distribution doesnâ€™t model the data very well

We need a likelihood function that describes a data generating process that produces discrete values of either 0 or 1

The binomial and Bernoulli distributions both describe that process

\[ \begin{aligned} y_i &\sim \text{Binomial}(1, p_i) \\ \text{logit}(p_i) &= \alpha + \beta \times x_i \end{aligned} \]

What is our likelihood function?

What is our link function?

Link functions are used to map the range of a linear equation to the range of the outcome variable

The logit link function restricts the range of the linear equation to fall between 0 and 1

The log link function restricts the range of the linear equation to be positive, continuous, and greater than 0

Our models always use the same linear structure (\(\mu_i = \alpha + \beta x_i\))

But, we can use any likelihood and link function that we want

With these two pieces, we can

**generalize**our linear equation to model any type of dataHence, the

**generalized linear model**It always has a linear equation, but it doesnâ€™t always have a normal distribution!

- Letâ€™s model flight delays with a more realistic likelihood function - the binomial

\[ \begin{aligned} y_i &\sim \text{Binomial}(1, p_i) \\ \text{logit}(p_i) &= \alpha + \beta \times x_i \end{aligned} \]

binomial.fit <- glm(delayed ~ distance_2sd, family = binomial(link = "logit"), data = flights)

All you have to do is “undo” the link function

Logit link function: \(\ln \left( \frac{P}{1 - P} \right)\)

Inverse logit link function: \(\frac{1}{1 + \exp^{(P)}}\)

Inverse logit link function in R:

`plogis()`

## (Intercept) distance_2sd ## -0.3022855 0.3921170

plogis(coef(binomial.fit))

## (Intercept) distance_2sd ## 0.4249989 0.5967922

A one unit increase in `distance_2sd`

increases the relative probability of having a delayed flight by about 60%

GLMs can be used to model any type of data (just about)

Pick a realistic likelihood function, then pick a link function

We want to model yard breaks as a function of the type of wool and the tension applied.

What kind of a distribution do we need?

Continuous or discrete?

Positive only, negative only, or all numbers?

Poisson is discrete and positive only

What kind of link function?

\[ \begin{aligned} y_i &\sim \text{Poisson}(\lambda_i) \\ \text{log}(\lambda_i) &= \alpha + \beta \times x_i \end{aligned} \]

poisson.fit <- glm(breaks ~ wool + tension, family = poisson(link = "log"), data = warpbreaks)

What if we just used a normal likelihood function?