by Chee Yee Lim

Posted on 2021-04-30

Collection of notes on model type - focusing on all types of linear models (statistical point of view).

- The general equation for a linear regression:
- \( y = \beta_0 + \beta_1 x_1 + ... + \beta_p x_p + \epsilon \), where \( \epsilon \) is the difference between predicted and observed values, assumed to be Gaussian distributed.
- THe confidence intervals on weights can be estimated by bootstrapping and repeatedly fitting the model.

- Linear regression can also be written in the form of a matrix.
- \( \boldsymbol{y} = \boldsymbol{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} \), where \( \boldsymbol{X} = \begin{pmatrix} 1 & x_{1,1} & x_{k,1} \\\ \vdots & \vdots & \vdots \\\ 1 & x_{1,N} & x_{k,N} \end{pmatrix} \), \( N \) is the number of observations, \( k \) is the number of variables, \( 1 \) represents the intercept.
- \( \epsilon \) has mean \( 0 \) and variance \( \sigma^2 I \)

- Advantages
- Powerful for their relative simplicity.
- Computationally cheap.

- Disadvantages
- By default, only model a linear relation between input and response variables.
- Require input processing.

- If there are many features to the model, it is possible to introduce sparsity into the model by using the following methods.
- Model regularisation using Lasso / L1-norm
- Stepwise model fitting
- Model fitting on dimensionally reduced data

- Multicollinearity will cause:
- Model instability
- Small changes in data lead to large changes in model (i.e. coefficients), making model interpretation difficult.
- Magnitude of model coefficients may not make sense.

- Numerical instability in model fitting
- The inverse of \( X^T X \) odes not exist, due to linear dependency between columns in the matrix. Hence standard OLS cannot be used.
- Approximated inversed may or may not be constructed depending on the solver algorithm used.

- However, if the goal is not to get interpretability and the fitting algorithm is not affected, then multicollinearity is not a concern.

- Model instability
- In Bayesian world, a linear regression is equivalent to an MLE on Gaussian likelihood.

- Linearity
- Linear relationship between predictors and target variables.
- This is its greatest strength and greatest weakness.

- Normality
- Error terms are normally distributed.

- Homoscedasticity
- Constant variance of error terms (across all features).

- Independence
- Each data point/error term must be independent.
- Absence of multicolinearity - strongly correlated features affect weight estimations.

- A linear regression is typically fitted by minimising the residual sum of squares (RSS).
- \( RSS(\beta) = \sum_{i=1}^N (y_i - \hat{y})^2 = \sum_{i=1}^N (y_i - \beta^T x_i)^2 \)
- This is known as Ordinary Least Squares (OLS). This has a closed form solution.

- Least squares estimation can be written in matrix formulation.
- \( \boldsymbol{\hat{\beta}} = (\boldsymbol{X}^T \boldsymbol{X})^{-1} \boldsymbol{X}^T \boldsymbol{y} \)
- This is sometimes known as the "normal equation".
- THe estimated coefficients require the inversion of the matrix \( \boldsymbol{X}^T \boldsymbol{X} \). If \( \boldsymbol{X} \) is not of full column rank, then matrix \( \boldsymbol{X}^T \boldsymbol{X} \) is singular and the model cannot be estimated. Then another way to solve for the coefficients is needed.
- The residual variance is estimated using \( \hat{\sigma}_{e}^2 = \frac{1}{T-k-1} (\boldsymbol{y} - \boldsymbol{X} \hat{\boldsymbol{\beta}})^T (\boldsymbol{y} - \boldsymbol{X} \hat{\boldsymbol{\beta}}) \).

- It can also be fitted by gradient descent.
- First define the loss function.
- \( E = \frac{1}{n} \sum_{i=0}^n (y_i - \bar{y}_i)^2 \)
- \( E = \frac{1}{n} \sum_{i=0}^n (y_i - (m x_i + c))^2 \)

- Calculate the partial derivative of the loss function wrt \( m \).
- \( D_m = \frac{1}{n} \sum_{i=0}^n 2 (y_i - (m x_i + c))(- x_i) \)
- \( D_m = \frac{-2}{n} \sum_{i=0}^n x_i (y_i - \bar{y}_i) \)

- Calculate the partial derivative of the loss function wrt \( c \).
- \( D_c = \frac{-2}{n} \sum_{i=0}^n (y_i - \bar{y}_i) \)

- Update \( m \) and \( c \) using \( D_m \) and \( D_c \) multiply with learning rate \( L \).
- \( m = m - L \times D_m \)
- \( c = c - L \times D_c \)

- First define the loss function.

- The fundamental equation of generalised linear model is:
- \( g(E(y)) = \alpha + \beta x_1 + \gamma x_2 \), where \( g() \) is the link function, \( E(y) \) is the expectation of target variable and \( \alpha + \beta x_1 + \gamma x_2 \) is the linear predictor ( \( \alpha, \beta, \gamma \) to be predicted).
- The role of link function is to 'link' the expectation of \( y \) to linear predictor.

- The choice of link function and random component distribution is separated.

- Linearity
- GLM does not assume a linear relationship between dependent and independent variables.
- However, it assumes a linear relationship between link function and independent variables.

- Normality
- The dependent variable need not be normally distributed.

- Independence
- Errors need to be independent but not normally distributed.

- Most other common assumptions of linear models are due to the use of OLS.
- E.g. linearity, normality, homoscedasticity and measurement level.

- Regularisation is a method for adding additional constraints or penalty to a model, with the goal of preventing overfitting and improving generalisation.
- Instead of minimising a loss function \( E(X,Y) \), the loss function to minimise becomes \( E(X,Y) + \alpha | w | \), where \( w \) is the vector of model coefficients, \( | \cdot | \) is typically L1 or L2 norm and \( \alpha \) is a tunable free parameter, specifying the amount of regularisation (so \( \alpha = 0 \) implies an unregularised model).

- L1 norm / LASSO (sum of betas)
- Since each non-zero coefficient adds to the penalty, it forces weak features to have zero as coefficients.
- Thus, L1 regularisation produces sparse solutions, inherently performing feature selection.

- L2 norm / ridge (square root sum of squared betas)
- Since the coefficients are squared in the penalty expression, it has a different effect from L1-norm, namely it forces the coefficient values to be spread out more equally.
- The effect of this is that models are much more stable (coefficients do not fluctuate on small data changes as is the case with unregularised or L1 models).
- In Bayesian world, L2 norm is equivalent to a Gaussian prior.

- GLM do not use OLS (ordinary least square) for parameter estimation. Instead, it uses maximum likelihood estimation (MLE).
- If applied to linear regression, MLE is the same as OLS.
- MLE is a more general technique that can be applied to many models.

- MLE is about estimating the parameters when we know the data and we assume a distribution.
- Likelihood is a measure of fit between some observed data and the population parameters.
- A higher likelihood implies a better fit between the observed data and the parameters.
- The goal of MLE is to find the population parameters that are more likely to have generated the observed data.

- A likelihood function looks just like a probability density function.
- The only difference being in a likelihood function, the data is taken as a given to estimate parameter; while in a probability density function, the parameter is a given to predict data.
- \( L(\theta | x) = f(x | \theta) \)
- e.g. for Bernoulli distribution:
- \( P(x | p) = p^x (1 - p)^{1-x} \)
- \( L(p | x,n) = \prod_{i=1}^n p^{x_i} (1 - p)^{1-x_i} \)

- As likelihood gives extremely small numbers that create rounding problems, we take the logarithm of the likelihood function, i.e. the log-likelihood.
- Since likelihoods are always between 0 and 1, log-likelihoods are always negative.
- E.g. for Bernoulli distribution:
- \( l(p | x,n) = \sum_{i=1}^n x_i log(p) + (1 - x_i) log(1-p) \)

- To find the parameter that corresponds to maximum log-likelihood, we need to look at the first derivative of log-likelihood (corresponding to gradient).
- Solving by finding the unknown parameter by setting the derivative formula to 0.
- This resulting value is the maximum likelihood estimates.
- This step is usually done iteratively, by first choosing arbitrary starting parameter values, then update it by evaluating the vector of partial derivatives of the log-likelihood function.

- Then we look at the second derivative of log-likelihood (corresponding to valley/peak).
- The value being negative will confirm the function being concave, i.e. we have reached a maximum.
- Otherwise the value being positive will suggest the function being convex, i.e. we have reached a minimum.

- Second derivative of log-likelihood is also used to compute the standard errors.
- The second derivative is a measure of curvature of a function. The steeper the curve, the more certain we are about our estimates.
- The matrix of second derivative is called Hessian.
- The inverse of the Hessian matrix is the variance-covariance matrix of the estimates.
- The standard errors of MLE are the square root of the diagonal entries (i.e. variance) of this matrix.

- The link function of a logistic regression is a logit function.
- The response Y (random component) of logistic regression has Binomial distribution.
- The general equation of a logistic regression is:
- Given a linear equation,
- \( \hat{y} = \beta_0 + \beta_1 x_1 + ... + \beta_p x_p \)

- Logistic regression (model output) is defined by,
- \( P(y = 1) = \frac{1}{1 + \exp (-(\beta_0 + \beta_1 x_1 + ... + \beta_p x_p))} \)

- Logistic regression is a linear model for the log odds.
- \( \log ( \frac{ P(y=1) }{ 1 - P(y=1) } ) = \log ( \frac{ P(y=1) }{ P(y=0) } ) = \beta_0 + \beta_1 x_1 + ... + \beta_p x_p \)
- \( odds = \frac{ probability of event }{ probability of no event } \)

- A logistic function / logit is defined by:
- \( logistic(\eta) = \frac{1}{ 1 + \exp (- \eta) } \)

- The implicit threshold of a logistic regression is always 0.5.

- Given a linear equation,

- Model is fitted by initially obtaining the maximum likelihood estimation (not using OLS).
- The process is repeated until the log likelihood does not change significantly.

- Does not assume linear relationship between dependent and independent variables.
- But assumes linearity between log odds and independent variables.

- Error terms should be independent but not normally distributed.
- Homoscedasticity (i.e. constant variance of error terms) is not assumed.
- Observations must be independent.
- Should not have multi-collinearity, i.e. correlations among independent variables.

- Poisson regression is a type of GLM used for modelling counts.
- In Poisson regression, response Y (random component) has a Poisson distribution that is \( y_i ~ Poisson( \mu_i ) for i = 1, ..., N \)
- Under most cases, the link function for Poisson regression is a natural log link,
- Sometimes, the identity link is used instead \( \mu = \beta_0 + \beta_1 x_1 \). Note that this can give \( \mu < 0 \).

- It is fitted by maximum likelihood estimation, i.e. finding values that maximises log-likelihood.
- There are no closed-form solutions, so MLE are obtained by using iterative algorithms such as Newton-Raphson, Iteratively re-weighted least squares (IRWLS) etc.

- Assessing goodness-of-fit
- Same as others. See above.
- On overdispersion in data, can consider adjusting for it or use negative binomial regression instead.
- Poisson assumes mean = variance, which is not true when over or underdispersion are present.

- It is a form of higher order regression modelled as an n-th degree polynomial.
- E.g. quadratic, cubic regressions.
- Quadratic regression : \( Y = m_1 X + m_2 X^2 + c \)
- Cubic regression : \( Y = m_1 X + m_2 X^2 + m_3 X^3 + c \)
- n-th regression : \( Y = m_1 X + m_2 X^2 + m_3 X^3 + ... + m_n X^n + c \)