# Data Science Overview

by Chee Yee Lim

Posted on 2021-03-10 Collection of notes on data science overview - high level overview of general data science concepts.

## Major Classes of Machine Learning Techniques

### Supervised Learning

• Supervised learning makes prediction using a training dataset with known labelled outcomes.
• Supervised learning consists of 3 major steps: training, testing and prediction.
• It can be categorized into classification or regression.
• Typical workflow of supervised learning
graph LR subgraph Prediction I[New Data] --> J[Data Preprocessing] J --> K[Trained Model] K --> L[Generate Predictions] L --> M[Validation] end subgraph Training & Validation A[Historical Data] --> B[Data Preprocessing] B --> C[Random Sampling] C -->|70%| D[Train Data] C -->|30%| E[Test Data] D --> F[Train Model] F --> G[Generate Predictions] E --> G G --> H[Validation] end
• List of algorithms
• Linear regression
• Linear method.
• Regression.
• Works by predicting the linear relationships between independent variables and a dependent variable.
• Survival regression
• Linear method.
• Regression.
• Used to predict the time when a specific event is going to happen.
• Main difference with linear regression is its ability to handle censoring, a type of missing data problem where the time to event is not known.
• Generalised linear model (GLM)
• Non-linear method.
• Regression.
• Uses a possibly non-linear link function to allow the modelling of non-Gaussian distributed outcomes.
• Generalized additive model (GAM)
• Non-linear method.
• Regression.
• Assumes the outcome can be modeled by a sum of arbitrary functions of each variable.
• These nonlinear functions are called splines.
• Logistic regression
• Linear method.
• Binary classification.
• Works by mapping outputs into probabilities using GLM.
• K-nearest neighbours
• Linear method.
• Regression and multi classification.
• For classification, works by classifying a point based on majority vote of k-nearest neighbours.
• For regression, works by taking the average outcome of the k-nearest neighbours.
• Important parameters to be decided are the right k and the distance measure between data points.
• Note that it is different from k-means clustering (unsupervised).
• Naïve Bayes
• Linear method.
• Multi classification.
• Works by finding the probability that a point belongs to a class for each feature separately.
• It naively assumes that the features in a dataset are independent (i.e. conditional independence of features), ignoring any possible correlations between features.
• Support vector machine
• Linear or non-linear method, depending on the kernel used.
• Binary classification.
• Works by finding the optimal hyperplane that maximises the margin between different classes.
• The data points closest to the classification boundary are known as support vectors.
• Computational complexity scales quadratically. Not suitable beyond 10,000s samples.
• Decision trees
• Non-linear method.
• Multi classification and regression.
• Works by learning the best decision rules to separate data based on the variables.
• Can work with both continuous and categorical data directly.
• Tree structure is ideal for capturing interactions between features and non-linearity in data.
• There is no need to transform features. (tree structure will work with any monotonic transformation of a feature)
• Should not one-hot encode categorical data, as this will cause worse model performance. (as each additional variable leads to a need for deeper tree)
• Tend to overfit.
• Random forest
• Non-linear method.
• Multi classification and regression.
• Works by using an ensemble (bagging) of decision trees trained in parallel and averaging their final predictions.
• Gradient-boosted trees
• Non-linear method.
• Multi classification and regression.
• Works by using an ensemble (boosting) of decision trees sequentially, which learns from the residuals of preceding trees to get more accurate results.
• XGBoost
• Non-linear method.
• Multi classification and regression.
• One of the dominant ML techniques for a long time.
• Better computation speed and support not in-memory calculation.
• LightGBM
• Non-linear method.
• Multi classification and regression.
• New challenger with better performance to XGBoost.
• LightGBM gives better performance due to the strategy of growing trees leaf-wise, rather than level-wise.

### Unsupervised Learning

• Unsupervised learning finds hidden patterns and structure in the dataset without the aid of labeled responses.
• Unsupervised learning is ideal when you only have access to input data and training data is unavailable or hard to obtain.
• It can be categorized into clustering and dimension reduction.
• Assuming classes are unknown, clustering groups unlabeled observations that share similar properties.
• Dimension reduction simplifies data with high dimensions by mapping them to a lower dimensional space. This also means it reduces the number of variables by finding a subspace that preserves the most information in the current high dimensional space.
• List of use cases for unsupervised learning
• Anomaly detection
• Anomaly detection identifies rare observations that deviate significantly and stand out from majority of the data.
• E.g. algorithms are isolation forest (based on random forest), one-class SVM (data points not belonging to the one class are outliers).
• Topic modelling
• In NLP, topic modelling is a form of dimension reduction aiming to find out the topics in a group of documents.
• E.g. algorithm is LDA (Latent Dirichlet Allocation).
• Recommendations
• Recommendation algorithms provide personalised recommendation based on customer behavior.
• E.g. algorithms are collaborative filtering, content-based filtering and associate rules learning (for market basket analysis).
• List of clustering algorithms
• k-means clustering
• Linear method. (Can be transformed by kernel)
• Works by iteratively minimising the distance of each data point to the nearest assigned centroid.
• Different distance metrics can be used, such as Euclidean and correlation.
• The downside is that each data point can only belong to one cluster, and solutions may stuck in a local optimum due to random initialization.
• k-means uses the expectation-maximisation approach to cluster points.
• List of dimension reduction algorithms
• PCA (Principal Component Analysis)
• Works by finding a set of linearly uncorrelated features on a low-dimensional subspace while preserving most of the variance in the data.
• LDA (Linear Discriminant Analysis)
• SVD (Singular Value Decomposition)

### Reinforcement Learning

• The objective of reinforcement learning is to map situations to actions that yield the maximum final reward.

• While mapping the action, the algorithm should not just consider the immediate reward but also the next and all subsequent rewards.
• List of algorithms

• Markov decision process
• Q-learning
• Temporal difference methods
• Monte-Carlo methods

## How to Pick a Machine Learning Model

• If pre-existing models or frameworks exist for similar projects, then start with the same models.
• These models have been experimented with and shown to work for similar problems.
• Can be used as a baseline model at the very least if they do not work as well as intended.
• If there are data on expected outcomes, then use supervised machine learning methods.
• If there is no data on expected outcomes, then use (1) non-machine learning algorithms, (2) pre-trained models or (3) unsupervised machine learning methods.
• If expected output is continuous, then use regression approach.
• If expected output is categorical, then use classification approach.