# Linear Models

by Chee Yee Lim

Posted on 2021-04-30

Collection of notes on model type - focusing on all types of linear models (statistical point of view).

## Linear Model

### Overview

• The general equation for a linear regression:
• $$y = \beta_0 + \beta_1 x_1 + ... + \beta_p x_p + \epsilon$$, where $$\epsilon$$ is the difference between predicted and observed values, assumed to be Gaussian distributed.
• THe confidence intervals on weights can be estimated by bootstrapping and repeatedly fitting the model.
• Linear regression can also be written in the form of a matrix.
• $$\boldsymbol{y} = \boldsymbol{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}$$, where $$\boldsymbol{X} = \begin{pmatrix} 1 & x_{1,1} & x_{k,1} \\\ \vdots & \vdots & \vdots \\\ 1 & x_{1,N} & x_{k,N} \end{pmatrix}$$, $$N$$ is the number of observations, $$k$$ is the number of variables, $$1$$ represents the intercept.
• $$\epsilon$$ has mean $$0$$ and variance $$\sigma^2 I$$
• Powerful for their relative simplicity.
• Computationally cheap.
• By default, only model a linear relation between input and response variables.
• Require input processing.
• If there are many features to the model, it is possible to introduce sparsity into the model by using the following methods.
• Model regularisation using Lasso / L1-norm
• Stepwise model fitting
• Model fitting on dimensionally reduced data
• Multicollinearity will cause:
1. Model instability
• Small changes in data lead to large changes in model (i.e. coefficients), making model interpretation difficult.
• Magnitude of model coefficients may not make sense.
2. Numerical instability in model fitting
• The inverse of $$X^T X$$ odes not exist, due to linear dependency between columns in the matrix. Hence standard OLS cannot be used.
• Approximated inversed may or may not be constructed depending on the solver algorithm used.
3. However, if the goal is not to get interpretability and the fitting algorithm is not affected, then multicollinearity is not a concern.
• In Bayesian world, a linear regression is equivalent to an MLE on Gaussian likelihood.

### Assumptions

• Linearity
• Linear relationship between predictors and target variables.
• This is its greatest strength and greatest weakness.
• Normality
• Error terms are normally distributed.
• Homoscedasticity
• Constant variance of error terms (across all features).
• Independence
• Each data point/error term must be independent.
• Absence of multicolinearity - strongly correlated features affect weight estimations.

### Model Fitting

• A linear regression is typically fitted by minimising the residual sum of squares (RSS).
• $$RSS(\beta) = \sum_{i=1}^N (y_i - \hat{y})^2 = \sum_{i=1}^N (y_i - \beta^T x_i)^2$$
• This is known as Ordinary Least Squares (OLS). This has a closed form solution.
• Least squares estimation can be written in matrix formulation.
• $$\boldsymbol{\hat{\beta}} = (\boldsymbol{X}^T \boldsymbol{X})^{-1} \boldsymbol{X}^T \boldsymbol{y}$$
• This is sometimes known as the "normal equation".
• THe estimated coefficients require the inversion of the matrix $$\boldsymbol{X}^T \boldsymbol{X}$$. If $$\boldsymbol{X}$$ is not of full column rank, then matrix $$\boldsymbol{X}^T \boldsymbol{X}$$ is singular and the model cannot be estimated. Then another way to solve for the coefficients is needed.
• The residual variance is estimated using $$\hat{\sigma}_{e}^2 = \frac{1}{T-k-1} (\boldsymbol{y} - \boldsymbol{X} \hat{\boldsymbol{\beta}})^T (\boldsymbol{y} - \boldsymbol{X} \hat{\boldsymbol{\beta}})$$.
• It can also be fitted by gradient descent.
• First define the loss function.
• $$E = \frac{1}{n} \sum_{i=0}^n (y_i - \bar{y}_i)^2$$
• $$E = \frac{1}{n} \sum_{i=0}^n (y_i - (m x_i + c))^2$$
• Calculate the partial derivative of the loss function wrt $$m$$.
• $$D_m = \frac{1}{n} \sum_{i=0}^n 2 (y_i - (m x_i + c))(- x_i)$$
• $$D_m = \frac{-2}{n} \sum_{i=0}^n x_i (y_i - \bar{y}_i)$$
• Calculate the partial derivative of the loss function wrt $$c$$.
• $$D_c = \frac{-2}{n} \sum_{i=0}^n (y_i - \bar{y}_i)$$
• Update $$m$$ and $$c$$ using $$D_m$$ and $$D_c$$ multiply with learning rate $$L$$.
• $$m = m - L \times D_m$$
• $$c = c - L \times D_c$$

## Generalised Linear Model

### Overview

• The fundamental equation of generalised linear model is:
• $$g(E(y)) = \alpha + \beta x_1 + \gamma x_2$$, where $$g()$$ is the link function, $$E(y)$$ is the expectation of target variable and $$\alpha + \beta x_1 + \gamma x_2$$ is the linear predictor ( $$\alpha, \beta, \gamma$$ to be predicted).
• The role of link function is to 'link' the expectation of $$y$$ to linear predictor.
• The choice of link function and random component distribution is separated.

### Assumptions

• Linearity
• GLM does not assume a linear relationship between dependent and independent variables.
• However, it assumes a linear relationship between link function and independent variables.
• Normality
• The dependent variable need not be normally distributed.
• Independence
• Errors need to be independent but not normally distributed.
• Most other common assumptions of linear models are due to the use of OLS.
• E.g. linearity, normality, homoscedasticity and measurement level.

### Model Regularisation

• Regularisation is a method for adding additional constraints or penalty to a model, with the goal of preventing overfitting and improving generalisation.
• Instead of minimising a loss function $$E(X,Y)$$, the loss function to minimise becomes $$E(X,Y) + \alpha | w |$$, where $$w$$ is the vector of model coefficients, $$| \cdot |$$ is typically L1 or L2 norm and $$\alpha$$ is a tunable free parameter, specifying the amount of regularisation (so $$\alpha = 0$$ implies an unregularised model).
• L1 norm / LASSO (sum of betas)
• Since each non-zero coefficient adds to the penalty, it forces weak features to have zero as coefficients.
• Thus, L1 regularisation produces sparse solutions, inherently performing feature selection.
• L2 norm / ridge (square root sum of squared betas)
• Since the coefficients are squared in the penalty expression, it has a different effect from L1-norm, namely it forces the coefficient values to be spread out more equally.
• The effect of this is that models are much more stable (coefficients do not fluctuate on small data changes as is the case with unregularised or L1 models).
• In Bayesian world, L2 norm is equivalent to a Gaussian prior.

### Model Fitting

• GLM do not use OLS (ordinary least square) for parameter estimation. Instead, it uses maximum likelihood estimation (MLE).
• If applied to linear regression, MLE is the same as OLS.
• MLE is a more general technique that can be applied to many models.
• MLE is about estimating the parameters when we know the data and we assume a distribution.
• Likelihood is a measure of fit between some observed data and the population parameters.
• A higher likelihood implies a better fit between the observed data and the parameters.
• The goal of MLE is to find the population parameters that are more likely to have generated the observed data.
• A likelihood function looks just like a probability density function.
• The only difference being in a likelihood function, the data is taken as a given to estimate parameter; while in a probability density function, the parameter is a given to predict data.
• $$L(\theta | x) = f(x | \theta)$$
• e.g. for Bernoulli distribution:
• $$P(x | p) = p^x (1 - p)^{1-x}$$
• $$L(p | x,n) = \prod_{i=1}^n p^{x_i} (1 - p)^{1-x_i}$$
• As likelihood gives extremely small numbers that create rounding problems, we take the logarithm of the likelihood function, i.e. the log-likelihood.
• Since likelihoods are always between 0 and 1, log-likelihoods are always negative.
• E.g. for Bernoulli distribution:
• $$l(p | x,n) = \sum_{i=1}^n x_i log(p) + (1 - x_i) log(1-p)$$
• To find the parameter that corresponds to maximum log-likelihood, we need to look at the first derivative of log-likelihood (corresponding to gradient).
• Solving by finding the unknown parameter by setting the derivative formula to 0.
• This resulting value is the maximum likelihood estimates.
• This step is usually done iteratively, by first choosing arbitrary starting parameter values, then update it by evaluating the vector of partial derivatives of the log-likelihood function.
• Then we look at the second derivative of log-likelihood (corresponding to valley/peak).
• The value being negative will confirm the function being concave, i.e. we have reached a maximum.
• Otherwise the value being positive will suggest the function being convex, i.e. we have reached a minimum.
• Second derivative of log-likelihood is also used to compute the standard errors.
• The second derivative is a measure of curvature of a function. The steeper the curve, the more certain we are about our estimates.
• The matrix of second derivative is called Hessian.
• The inverse of the Hessian matrix is the variance-covariance matrix of the estimates.
• The standard errors of MLE are the square root of the diagonal entries (i.e. variance) of this matrix.

## Logistic Regression

### Overview

• The link function of a logistic regression is a logit function.
• The response Y (random component) of logistic regression has Binomial distribution.
• The general equation of a logistic regression is:
• Given a linear equation,
• $$\hat{y} = \beta_0 + \beta_1 x_1 + ... + \beta_p x_p$$
• Logistic regression (model output) is defined by,
• $$P(y = 1) = \frac{1}{1 + \exp (-(\beta_0 + \beta_1 x_1 + ... + \beta_p x_p))}$$
• Logistic regression is a linear model for the log odds.
• $$\log ( \frac{ P(y=1) }{ 1 - P(y=1) } ) = \log ( \frac{ P(y=1) }{ P(y=0) } ) = \beta_0 + \beta_1 x_1 + ... + \beta_p x_p$$
• $$odds = \frac{ probability of event }{ probability of no event }$$
• A logistic function / logit is defined by:
• $$logistic(\eta) = \frac{1}{ 1 + \exp (- \eta) }$$
• The implicit threshold of a logistic regression is always 0.5.

### Model Fitting

• Model is fitted by initially obtaining the maximum likelihood estimation (not using OLS).
• The process is repeated until the log likelihood does not change significantly.

### Assumptions

• Does not assume linear relationship between dependent and independent variables.
• But assumes linearity between log odds and independent variables.
• Error terms should be independent but not normally distributed.
• Homoscedasticity (i.e. constant variance of error terms) is not assumed.
• Observations must be independent.
• Should not have multi-collinearity, i.e. correlations among independent variables.

## Poisson Regression

### Overview

• Poisson regression is a type of GLM used for modelling counts.
• In Poisson regression, response Y (random component) has a Poisson distribution that is $$y_i ~ Poisson( \mu_i ) for i = 1, ..., N$$
• Under most cases, the link function for Poisson regression is a natural log link,
• Sometimes, the identity link is used instead $$\mu = \beta_0 + \beta_1 x_1$$. Note that this can give $$\mu < 0$$.

### Model Fitting

• It is fitted by maximum likelihood estimation, i.e. finding values that maximises log-likelihood.
• There are no closed-form solutions, so MLE are obtained by using iterative algorithms such as Newton-Raphson, Iteratively re-weighted least squares (IRWLS) etc.

### Model Fitting

• Assessing goodness-of-fit
• Same as others. See above.
• On overdispersion in data, can consider adjusting for it or use negative binomial regression instead.
• Poisson assumes mean = variance, which is not true when over or underdispersion are present.

## Polynomial Regression

### Overview

• It is a form of higher order regression modelled as an n-th degree polynomial.
• Quadratic regression : $$Y = m_1 X + m_2 X^2 + c$$
• Cubic regression : $$Y = m_1 X + m_2 X^2 + m_3 X^3 + c$$
• n-th regression : $$Y = m_1 X + m_2 X^2 + m_3 X^3 + ... + m_n X^n + c$$