by Chee Yee Lim

Posted on 2021-05-01

Collection of notes on model type - focusing on traditional machine learning models.

- Key concepts
- k-nearest neighbours work by classifying a data point based on majority vote of its k-nearest neighbours class.
- The key drawback of kNN is the complexity in searching the nearest neighbours for each sample.
- It is important to (1) choose an odd k value for a two-class and (2) k must not be a multiple of the number of classes.

- Key concepts
- The key objective of SVM is to draw a hyperplane (i.e. decision surface) that separates the two classes optimally such that the margin is maximum between the hyperplane and the observations.
- SVM
- The data points closest to the hyperplane are the support vectors.
- Only the support vectors (which are a small subset of bigger dataset) is used to specify the decision function (hence decision surface).

- A hyperplane is defined as \( n-l \) dimension of the current \( n \) dimension space.
- By default, SVM works only on data with linear relationships. However there are 2 ways to modify SVM to address non-linearly separable cases (which forms 2 key parameters of SVM).
- Soft margin.
- Find a hyperplane but tolerates a few misclassified data points.
- This degree of tolerance is parameterised by a penalty term.
- The higher the penalty term, the narrower the margin and SVM will depend on fewer support vectors.

- Kernel trick.
- Find a non-linear decision boundary, by using a non-linear kernel such as polynomial and radial basis function.
- The idea is to gain linear separation by transforming the data to a higher dimensional space.

- Soft margin.

- Advantages
- Less sensitive to outliers because it focuses only on the points that are closest to the decision boundary or support vectors.

- Key concepts
- It is one of the few universally useful and applicable family of methods.

- Advantages
- Can handle non-linear relationship and feature interactions. It also does not require feature transformation.
- Require relatively fewer input data processing.
- Robust to outliers.
- Easy to interpret (for decision tree).

- Key concepts
- A decision tree iteratively decides on splitting criteria that binary split data into subsets, and will continue until no further splits can be made.
- It basically comes up with a set of rules that allow the data to be iteratively partition into subsets that are similar in nature.

- Advantages
- Decision trees are very interpretable (as long as the depth is low).

- Disadvantages
- Trees (in non-ensemble setting) tend to overfit (generalize poorly).
- Trees cannot deal with linear relationships due to the use of splits, which create step functions.
- A small change in the magnitudes of inputs can result in a big change in predictions, due to the value crossing the step function threshold.
- Trees are unstable, because the fitted tree structure can change drastically if the data change.

- CART algorithm
- The most popular algorithm for tree induction is the classification and regression trees (CART) algorithm.
- At each node, CART picks a feature that minimizes the Gini index of the class distribution for classification problems; and the variance for regression problems.
- The Gini index tells us how "impure" a node is, e.g. if all classes have the same frequency, the node is impure; if only one class is present, it is maximally pure.
- Only one feature with a corresponding threshold is used at each split.
- The search and split will continue greedily and recursively until a stop criterion is reached.

- The stopping criterion is very important as it influences the performance of the tree.
- Pruning can be used after learning the tree to further improve performance.

- Feature importance
- It is calculated by going through all splits for each feature, and measure how much it has reduced the variance or Gini index compared to the parent node.
- The sum of all importance is scaled to 100.

- Key concepts
- A random forest is simply a set of decision trees in parallel, trained using random bootstrapped data (per tree) and random feature selection (per node).
- A random forest is a type of bagging technique.
- Bagging involves training each individual learner/estimator on different bootstrapped subsets of the data, and then averaging the predictions.

- Each decision tree is trained on a subset of the data, bootstrapped (sampling with replacement).
- Each decision tree gets all features, but each split will only consider some of the features. In the end, each split uses only one feature that gives the best score (e.g. minimise Gini impurity).
- A criterion is used to guide the splits, commonly being Gini index.
- The final prediction is made by averaging the predictions of each individual tree.

- Advantages
- Have high accuracy and low variance even with very little hyperparameter tuning.
- Very scalable - computation is easy to parallelise in both training and testing phases.

- Disadvantages
- Black box - due to difficulty in understanding many and deep trees.

- Key parameters to tune
- Number of estimators - the higher, the better, until saturation.
- Number of variables used for node splitting - the higher, the better.
- Max depth of trees - the higher, the better. Will definitely overfit if too high.
- Minimum number of samples at each leaf node - the smaller, the better. Will overfit if too small.

- There are 3 ways to prevent overfitting a random forest model.
- Limit the max depth of a tree.
- Prune the trees after model training.
- Early stop tree growing if objective function no longer improving by a specified margin.