Data Science Overview

by Chee Yee Lim


Posted on 2021-03-10



Collection of notes on data science overview - high level overview of general data science concepts.


Major Classes of Machine Learning Techniques

Supervised Learning

  • Supervised learning makes prediction using a training dataset with known labelled outcomes.
  • Supervised learning consists of 3 major steps: training, testing and prediction.
  • It can be categorized into classification or regression.
  • Typical workflow of supervised learning
graph LR subgraph Prediction I[New Data] --> J[Data Preprocessing] J --> K[Trained Model] K --> L[Generate Predictions] L --> M[Validation] end subgraph Training & Validation A[Historical Data] --> B[Data Preprocessing] B --> C[Random Sampling] C -->|70%| D[Train Data] C -->|30%| E[Test Data] D --> F[Train Model] F --> G[Generate Predictions] E --> G G --> H[Validation] end
  • List of algorithms
    • Linear regression
      • Linear method.
      • Regression.
      • Works by predicting the linear relationships between independent variables and a dependent variable.
    • Survival regression
      • Linear method.
      • Regression.
      • Used to predict the time when a specific event is going to happen.
      • Main difference with linear regression is its ability to handle censoring, a type of missing data problem where the time to event is not known.
    • Generalised linear model (GLM)
      • Non-linear method.
      • Regression.
      • Uses a possibly non-linear link function to allow the modelling of non-Gaussian distributed outcomes.
    • Generalized additive model (GAM)
      • Non-linear method.
      • Regression.
      • Assumes the outcome can be modeled by a sum of arbitrary functions of each variable.
      • These nonlinear functions are called splines.
    • Logistic regression
      • Linear method.
      • Binary classification.
      • Works by mapping outputs into probabilities using GLM.
    • K-nearest neighbours
      • Linear method.
      • Regression and multi classification.
      • For classification, works by classifying a point based on majority vote of k-nearest neighbours.
      • For regression, works by taking the average outcome of the k-nearest neighbours.
      • Important parameters to be decided are the right k and the distance measure between data points.
      • Note that it is different from k-means clustering (unsupervised).
    • Naïve Bayes
      • Linear method.
      • Multi classification.
      • Works by finding the probability that a point belongs to a class for each feature separately.
      • It naively assumes that the features in a dataset are independent (i.e. conditional independence of features), ignoring any possible correlations between features.
    • Support vector machine
      • Linear or non-linear method, depending on the kernel used.
      • Binary classification.
      • Works by finding the optimal hyperplane that maximises the margin between different classes.
      • The data points closest to the classification boundary are known as support vectors.
      • Computational complexity scales quadratically. Not suitable beyond 10,000s samples.
    • Decision trees
      • Non-linear method.
      • Multi classification and regression.
      • Works by learning the best decision rules to separate data based on the variables.
      • Can work with both continuous and categorical data directly.
      • Tree structure is ideal for capturing interactions between features and non-linearity in data.
      • There is no need to transform features. (tree structure will work with any monotonic transformation of a feature)
      • Should not one-hot encode categorical data, as this will cause worse model performance. (as each additional variable leads to a need for deeper tree)
      • Tend to overfit.
    • Random forest
      • Non-linear method.
      • Multi classification and regression.
      • Works by using an ensemble (bagging) of decision trees trained in parallel and averaging their final predictions.
    • Gradient-boosted trees
      • Non-linear method.
      • Multi classification and regression.
      • Works by using an ensemble (boosting) of decision trees sequentially, which learns from the residuals of preceding trees to get more accurate results.
    • XGBoost
      • Non-linear method.
      • Multi classification and regression.
      • One of the dominant ML techniques for a long time.
      • Better computation speed and support not in-memory calculation.
    • LightGBM
      • Non-linear method.
      • Multi classification and regression.
      • New challenger with better performance to XGBoost.
      • LightGBM gives better performance due to the strategy of growing trees leaf-wise, rather than level-wise.

Unsupervised Learning

  • Unsupervised learning finds hidden patterns and structure in the dataset without the aid of labeled responses.
  • Unsupervised learning is ideal when you only have access to input data and training data is unavailable or hard to obtain.
  • It can be categorized into clustering and dimension reduction.
    • Assuming classes are unknown, clustering groups unlabeled observations that share similar properties.
    • Dimension reduction simplifies data with high dimensions by mapping them to a lower dimensional space. This also means it reduces the number of variables by finding a subspace that preserves the most information in the current high dimensional space.
  • List of use cases for unsupervised learning
    • Anomaly detection
      • Anomaly detection identifies rare observations that deviate significantly and stand out from majority of the data.
      • E.g. algorithms are isolation forest (based on random forest), one-class SVM (data points not belonging to the one class are outliers).
    • Topic modelling
      • In NLP, topic modelling is a form of dimension reduction aiming to find out the topics in a group of documents.
      • E.g. algorithm is LDA (Latent Dirichlet Allocation).
    • Recommendations
      • Recommendation algorithms provide personalised recommendation based on customer behavior.
      • E.g. algorithms are collaborative filtering, content-based filtering and associate rules learning (for market basket analysis).
  • List of clustering algorithms
    • k-means clustering
      • Linear method. (Can be transformed by kernel)
      • Works by iteratively minimising the distance of each data point to the nearest assigned centroid.
      • Different distance metrics can be used, such as Euclidean and correlation.
      • The downside is that each data point can only belong to one cluster, and solutions may stuck in a local optimum due to random initialization.
      • k-means uses the expectation-maximisation approach to cluster points.
  • List of dimension reduction algorithms
    • PCA (Principal Component Analysis)
      • Works by finding a set of linearly uncorrelated features on a low-dimensional subspace while preserving most of the variance in the data.
    • LDA (Linear Discriminant Analysis)
    • SVD (Singular Value Decomposition)

Reinforcement Learning

  • The objective of reinforcement learning is to map situations to actions that yield the maximum final reward.

    • While mapping the action, the algorithm should not just consider the immediate reward but also the next and all subsequent rewards.
  • List of algorithms

    • Markov decision process
    • Q-learning
    • Temporal difference methods
    • Monte-Carlo methods

How to Pick a Machine Learning Model

  • If pre-existing models or frameworks exist for similar projects, then start with the same models.
    • These models have been experimented with and shown to work for similar problems.
    • Can be used as a baseline model at the very least if they do not work as well as intended.
  • If there are data on expected outcomes, then use supervised machine learning methods.
    • If there is no data on expected outcomes, then use (1) non-machine learning algorithms, (2) pre-trained models or (3) unsupervised machine learning methods.
  • If expected output is continuous, then use regression approach.
    • If expected output is categorical, then use classification approach.