- Supervised learning makes prediction using a training dataset with known labelled outcomes.
- Supervised learning consists of 3 major steps: training, testing and prediction.
- It can be categorized into classification or regression.
- Typical workflow of supervised learning
graph LR
subgraph Prediction
I[New Data] --> J[Data Preprocessing]
J --> K[Trained Model]
K --> L[Generate Predictions]
L --> M[Validation]
end
subgraph Training & Validation
A[Historical Data] --> B[Data Preprocessing]
B --> C[Random Sampling]
C -->|70%| D[Train Data]
C -->|30%| E[Test Data]
D --> F[Train Model]
F --> G[Generate Predictions]
E --> G
G --> H[Validation]
end
- List of algorithms
- Linear regression
- Linear method.
- Regression.
- Works by predicting the linear relationships between independent variables and a dependent variable.
- Survival regression
- Linear method.
- Regression.
- Used to predict the time when a specific event is going to happen.
- Main difference with linear regression is its ability to handle censoring, a type of missing data problem where the time to event is not known.
- Generalised linear model (GLM)
- Non-linear method.
- Regression.
- Uses a possibly non-linear link function to allow the modelling of non-Gaussian distributed outcomes.
- Generalized additive model (GAM)
- Non-linear method.
- Regression.
- Assumes the outcome can be modeled by a sum of arbitrary functions of each variable.
- These nonlinear functions are called splines.
- Logistic regression
- Linear method.
- Binary classification.
- Works by mapping outputs into probabilities using GLM.
- K-nearest neighbours
- Linear method.
- Regression and multi classification.
- For classification, works by classifying a point based on majority vote of k-nearest neighbours.
- For regression, works by taking the average outcome of the k-nearest neighbours.
- Important parameters to be decided are the right k and the distance measure between data points.
- Note that it is different from k-means clustering (unsupervised).
- Naïve Bayes
- Linear method.
- Multi classification.
- Works by finding the probability that a point belongs to a class for each feature separately.
- It naively assumes that the features in a dataset are independent (i.e. conditional independence of features), ignoring any possible correlations between features.
- Support vector machine
- Linear or non-linear method, depending on the kernel used.
- Binary classification.
- Works by finding the optimal hyperplane that maximises the margin between different classes.
- The data points closest to the classification boundary are known as support vectors.
- Computational complexity scales quadratically. Not suitable beyond 10,000s samples.
- Decision trees
- Non-linear method.
- Multi classification and regression.
- Works by learning the best decision rules to separate data based on the variables.
- Can work with both continuous and categorical data directly.
- Tree structure is ideal for capturing interactions between features and non-linearity in data.
- There is no need to transform features. (tree structure will work with any monotonic transformation of a feature)
- Should not one-hot encode categorical data, as this will cause worse model performance. (as each additional variable leads to a need for deeper tree)
- Tend to overfit.
- Random forest
- Non-linear method.
- Multi classification and regression.
- Works by using an ensemble (bagging) of decision trees trained in parallel and averaging their final predictions.
- Gradient-boosted trees
- Non-linear method.
- Multi classification and regression.
- Works by using an ensemble (boosting) of decision trees sequentially, which learns from the residuals of preceding trees to get more accurate results.
- XGBoost
- Non-linear method.
- Multi classification and regression.
- One of the dominant ML techniques for a long time.
- Better computation speed and support not in-memory calculation.
- LightGBM
- Non-linear method.
- Multi classification and regression.
- New challenger with better performance to XGBoost.
- LightGBM gives better performance due to the strategy of growing trees leaf-wise, rather than level-wise.
- Unsupervised learning finds hidden patterns and structure in the dataset without the aid of labeled responses.
- Unsupervised learning is ideal when you only have access to input data and training data is unavailable or hard to obtain.
- It can be categorized into clustering and dimension reduction.
- Assuming classes are unknown, clustering groups unlabeled observations that share similar properties.
- Dimension reduction simplifies data with high dimensions by mapping them to a lower dimensional space. This also means it reduces the number of variables by finding a subspace that preserves the most information in the current high dimensional space.
- List of use cases for unsupervised learning
- Anomaly detection
- Anomaly detection identifies rare observations that deviate significantly and stand out from majority of the data.
- E.g. algorithms are isolation forest (based on random forest), one-class SVM (data points not belonging to the one class are outliers).
- Topic modelling
- In NLP, topic modelling is a form of dimension reduction aiming to find out the topics in a group of documents.
- E.g. algorithm is LDA (Latent Dirichlet Allocation).
- Recommendations
- Recommendation algorithms provide personalised recommendation based on customer behavior.
- E.g. algorithms are collaborative filtering, content-based filtering and associate rules learning (for market basket analysis).
- List of clustering algorithms
- k-means clustering
- Linear method. (Can be transformed by kernel)
- Works by iteratively minimising the distance of each data point to the nearest assigned centroid.
- Different distance metrics can be used, such as Euclidean and correlation.
- The downside is that each data point can only belong to one cluster, and solutions may stuck in a local optimum due to random initialization.
- k-means uses the expectation-maximisation approach to cluster points.
- List of dimension reduction algorithms
- PCA (Principal Component Analysis)
- Works by finding a set of linearly uncorrelated features on a low-dimensional subspace while preserving most of the variance in the data.
- LDA (Linear Discriminant Analysis)
- SVD (Singular Value Decomposition)
- If pre-existing models or frameworks exist for similar projects, then start with the same models.
- These models have been experimented with and shown to work for similar problems.
- Can be used as a baseline model at the very least if they do not work as well as intended.
- If there are data on expected outcomes, then use supervised machine learning methods.
- If there is no data on expected outcomes, then use (1) non-machine learning algorithms, (2) pre-trained models or (3) unsupervised machine learning methods.
- If expected output is continuous, then use regression approach.
- If expected output is categorical, then use classification approach.