Data Science Project Management

by Chee Yee Lim


Posted on 2021-04-22



Collection of notes on data science project management - everything requires to run a data science project successfully. This includes project workflow and stakeholder management.


Data Science Project Workflow

Typical Proof-of-concept Data Science Projects

graph LR A[Understand Business] --> B[Identify KPI] B --> C[Understand Current Solution] C --> D[Locate Data] D --> E[Decide Model] E --> F[Prepare Output] F --> G[Provide Recommendation] G --> H[Conclude on Financial Impact]
  1. Understand and summarise business issue
    1. Ask clarifying questions to understand background
    2. Understand key business objectives
  2. Identify business KPIs to measure impact of solution
    1. More focus on business KPIs connected to business impact
    2. Less focus on machine learning KPIs such as precision/recall.
  3. Understand existing processes
    1. How is the problem being solved currently?
    2. What are existing processes around business activities/data collection?
  4. Identify data source
    1. What data is available (internally or externally)?
    2. Focus on most important data (highest ROI).
  5. Perform exploratory data analysis
    1. Check quality of data (e.g. missing data, imbalance data)
    2. Check if trends in data agree with expectations
  6. Understand requirements for the model
    1. ML metrics. Precision or recall?
    2. Should model be interpretable?
  7. Plan modelling approach
    1. Data processing
    2. Feature selections
    3. Model algorithms & assumptions/limitations
    4. Objective functions/Evaluation metrics
  8. Prepare outputs
    1. What are the outputs and how are they consumed (dynamic dashboard, static report)?
  9. Provide recommended actions/solutions/outputs
    1. Recommended actions/solutions should mix business and data science.
    2. Explain model results relative to business goals
  10. Conclude about implementation and impact
    1. Estimate financial impact from the model
    2. Be sure the action answers the business goal
    3. Explain how to implement model into production

Data Science Project Management Frameworks

  • Standard project management approaches usually do not fit very well on data science projects.
    • This is because data science projects traditionally involve a long exploration phase and many unknowns even quite late into a project.
    • Data science projects are different from traditional software development, where it is possible to estimate resources needed and have clear defined outcomes before a project even started.

Waterfall

  • Overview
    • Waterfall system is the most traditional method for managing a project, with team members working linearly towards a set end goal.
    • Each participant has a clearly defined role, and none of the phases or goals are expected to change. Changes are often discouraged and costly.
    • It works best for projects with long, detailed plans that require a single timeline.
    • Waterfall workflow is typically visualised using Gantt charts.
  • Typical stages
    1. Requirements phase - Gathers and analyzes all the requirements and documentation for the project.
    2. Design phase - Designs the project's workflow model.
    3. Implementation phase - Begins working on the project's implementations.
    4. Testing phase - Tests each implementation to ensure it fulfills the requirements.
    5. Deployment phase - Deploys or delivers the product.
    6. Maintenance phase - Maintains the product.
  • Pros
    • Simpler to budget resources (money, time and labour) that go into the project.
    • Team members can concentrate more on working on the product.
    • Ease of replication for future similar tasks.
  • Cons
    • The strategy is relatively inflexible since each step is preplanned in a linear sequence.

Agile

  • Overview
    • Agile project management is an iterative approach to delivering a project throughout its life cycle.
      • Iterative or agile life cycles are composed of several iterations or incremental steps towards the completion of a project.
    • One of the aims of an agile approach is to release benefits throughout the process rather than only at the end.
    • Teams work in short cycles, called sprints, to provide continuous improvements.
    • Agile is a framework and a working mind-set which helps respond to changing requirements.
    • An early form of this methodology is Scrum, which introduced the concept of requirements volatility.
      • This principle acknowledges the reality that customers might end up having different needs or expectations for the software than originally intended.
    • Usually workflow in Agile framework is managed using Kanban boards, where each task is representend as a card assigned to different project stages.
  • Major principles
    • Should exhibit central values of trust, flexibility, empowerment and collaboration.
    • Breaks down project into smaller pieces, which are then prioritised by the team in terms of importance.
    • Promotes collaborative working, especially with stakeholders/customers.
    • Focus on individuals and interactions rather than process and tools, to promote an environment that encourages consensus.
    • Responds to change over following a structured plan.
    • Reflects, learns and adjusts at regular intervals to ensure that the customer is always satisfied and is provided with outcomes that result in benefits.
    • Integrate planning with execution, allowing an organisation to create a working midset that helps a team respond effectively to changing requirements.
  • Pros
    • Leads to incremental improvement in products and a greater focus on delivery cycle.
    • Helps build client and user engagement because changes are incremental and evolutionary rather than revolutionary - it can therefore be effective in supporting cultural change that is critical to the success of most transformation projects.
    • Allows ideas to be tested and rejected early with tight feedback loops.
    • Encourages innovations.
  • Cons
    • Rapid changes and more features may lead to more bugs.

CRISP-DM

  • Overview
    • CRISP-DM stands for cross-industry process for data mining.
    • CRISP-DM provides a structured approach to planning a data mining project.
  • Typical stages
    1. Business understanding
      • Understand what needs to be accomplish from a business perspective.
      • The goal of this stage is to uncover important factors that could influence the outcome of the project.
      • Key tasks in this stage include:
        • Define business and data mining success criteria.
        • Assess current situation in terms of resources, constraints, risks and costs.
        • Produce project plan on key stages and tasks.
    2. Data understanding
      • Acquire and understand the data required for the project.
      • Key tasks in this stage include:
        • List data sources together with information on how to collect them.
        • Describe data properties and evaluate if they satisfy project requirements.
        • Explore data to get results on first findings or initial hypothesis.
        • Verify data quality.
    3. Data preparation
      • Key tasks in this stage include:
        • Select data to use for analysis based on project goals, data quality etc.
        • Clean data (e.g. handle missing data).
        • Construct data (i.e. feature engineering).
        • Integrate date (i.e. merge multiple data, calculate aggregation)
    4. Modelling
      • Key tasks in this stage include:
        • Select modelling technique to use.
        • Generate test design (i.e. how to test and evaluate models).
        • Build models and document results.
        • Assess model performance.
        • Iteratively improve model (e.g. changing parameters).
    5. Evaluation
      • This evaluation step focus on whether the model meets the business objectives.
      • While previous evaluation steps focus on technical aspect of the model (e.g. accuracy, generality).
      • Key tasks in this stage include:
        • Assess if model meets business objectives.
        • Review process and highlight activities that have been done well or can be improved.
        • Determine next steps (e.g. whether to move to deployment or initiate further iterations).
    6. Deployment
      • Key tasks in this stage:
        • Plan deployment (e.g. determine strategy and ways of deployment).
        • Plan monitoring and maintenance of the model in production environment.
        • Produce final report and presentation.
        • Review project (e.g. summarise important experience gained during the project).

Sources