Chee Yee Lim

Data Scientist in Singapore

"The most important and difficult aspects of data science are often not on the machine learning techniques, but on defining clear business problems and engaging stakeholders throughout the process."

Full stack data scientist with a focus on solving actual business problems using the most effective machine learning techniques.

Experienced in the full data product lifecycle from data collection, through processing and analysis, to production deployment and communication of results to clients.


Portfolio

A selection of dashboards built using Python Dash and public data, with a sprinkle of machine learning.

Singapore Resale HDB Analytics

Dashboard to understand the pricing history of HDB resale flats.

Open Dashboard
NLP-powered Text Analyser

Tool to analyse and extract information from any input text.

Open Tool
Optimisation Visualiser

Dashboard to visualise algorithms searching on optimisation surface.

Open Dashboard

Articles

A selection of articles on everything data science that are written/compiled by me.

Blog Posts

Blog posts to share my data science/machine learning journey

Read More
Code Templates

Code templates that I have written to handle common data science/machine learning tasks

Read More
Note Collections

Collections of notes that I have compiled on data science/machine learning

Read More

Experience

Senior Data Scientist

DHL Consulting, Singapore
  • Drive data analytics projects for both internal and external clients.
Sept 2020 - Present

Data Scientist

PatSnap, Singapore
  • Developed a patent valuation model by integrating novel NLP-based metrics extracted from patent texts with traditional patent indicators for patent value prediction.
    • Deployed the random forest-based model into production in 2 forms: (1) as a dockerised Flask API model for generating real-time predictions on new data, and (2) as a PySpark pipeline for generating batch predictions on historical data.
    • Worked closely with a team of 3 product managers and 2 engineers to ensure the product is developed on-time while achieving business goals (i.e. user-requested features) and fitting into existing IT infrastructure (i.e. data ETL pipelines).
    • Crafted a go-to-market strategy with product managers and marketing specialists by performing market analysis against 3 competitors to effectively market our new patent valuation capability to our prospective clients.
  • Mentored a junior team member by listening to his needs and providing technical assistance to ensure he can complete his project on-time.
July 2019 - Sept 2020

Data Scientist

Schroders, UK
  • Led the development of the human capital data product to provide summary insights into the board director relationships and career histories for 20,000+ public companies globally.
    • Engineered the backend ETL using distributed processing (Scala + Spark + SQL) and the frontend using R Shiny dashboard (Rmarkdown + plotly).
    • Final product perceived as ‘a distinct value-add and massive time saver’ by the heads of 3 investment research teams who requested their analysts to use the product as part of their investment process.
  • Developed a market condition model to predict the medium-term drawdown risk of an equity index to help establish the first quantamental investment approach for the equity team based in Hong Kong.
    • Constructed the model using time series logistic regression, which is an interpretable ML approach that facilitates understanding and therefore adoption by stakeholders.
    • Presented results to stakeholders, which decided to regularly consider the modelling output as part of their monthly investment research meeting.
  • Liaised with 9 data vendors and verified the quality of alternative data by checking their data collection and processing methodology, as well as comparing the data with known information.
September 2017 - May 2019

PhD Researcher

University of Cambridge, UK
  • Used machine learning techniques to study how stem cells make developmental decisions by analysing terabytes of time-series single-cell expression data.
    • Reconstructed the development timeline with a polynomial model fitted to a kernel PCA-reduced space, which enables the subsequent inference of potential causal relationships among genes using penalised vector autoregression and Boolean models.
    • Initiated collaborations with 4 other research groups, including a lab in Microsoft Research Cambridge, to bring together experts in statistics, computer science and biology.
    • Resulted in 2 research papers, one of which is a first-authored paper.
  • Commended by at least 3 senior researchers on my public speaking ability in presenting complex technical terms clearly and passionately.
October 2013 - September 2017

Education

University of Cambridge, UK

PhD in Computational Biology
  • Graduated on-time with 2 research papers and presented a poster at the ISMB conference.
  • Tutored for 2 bioinformatics courses and led the Wolfson College Table Tennis team.
October 2013 - September 2017

University of Edinburgh, UK

BSc (Hons) in Genetics
  • Achieved 1st class despite skipping the first year of study via direct entry to the second year.
  • Represented 30 students as a class representative and voiced out concerns affecting students.
September 2010 - June 2013

Skills

Programming Languages & Tools
  • Python
  • R
  • SQL
  • HTML
  • Apache Spark
  • Apache Solr
  • PyTorch
  • Git
  • Docker
  • Cloud Computing
Machine Learning & Statistical Methods
  • Exploratory Data Analysis
  • Time-series Data Analysis
  • Natural Language Processing
  • Graph Analysis
Human Languages
  • English - Fluent
  • Mandarin - Native
  • Malay - Intermediate
  • Cantonese - Conversational

Achievements

Certifications
  • Google Cloud Training ( View)
  • Passed CFA level 1
  • Investment Management Certificate
Publications
  • Distinct molecular trajectories converge to induce naive pluripotency. Cell Stem Cell September 2019. ( View)
  • Understanding transcriptional regulation through computational analysis of single-cell transcriptomics. Doctoral thesis September 2017. ( View)
  • BTR: training asynchronous Boolean models using single-cell expression data. BMC Bioinformatics September 2016. ( View)

Interests

  • Full stack development for data science/machine learning

    I enjoy this even in my free time. Deploying data-related products give me a weird sense of achievements. Although I do admit that I enjoy backend engineering more than frontend engineering.

  • Personal investment

    Investment is a field where skills can be easily quantifiable, but fools can achieve great returns by just being lucky. The chaotic nature of the investment market is remarkably similar to a biological system, where a collection of simple rules lead to a very complex system.

  • Table tennis

    The only sports I am good at - the sports where I understand the theory, but lack the physical prowess to execute them.

  • Travel

    Seemingly paradoxical, I enjoy both spending quiet time in nature and sitting in a cafe beside a busy street corner in a foreign country. Both activities allow me to observe others and reflect on myself.