Skip to content

Engineering Data Science Systems

What is Systems Thinking? πŸ€–


In a more broader sense systems thinking can be defined as a way of thinking about systems that is global🌍 and encompassing rather than focused on particular issue.

Understand the larger context and then make choices around the design that we are building.


Even if we are building a small part of a large system, we should have in mind what the larger goal is and doing so is Systems thinking.


Let’s take an example of engineering a Data Science System

Question

what do you think would be the components in a Data Science System

The first answer that comes to mind is Data . But what else Data Science System = Data + β€”β€” + β€”β€” + β€”β€” …….

One way to fill the gap is by understanding the process involved to engineer a Data Science System.

Data Science Process

Thinking about it, an easy guess would be domain knowledge along with Maths, Stats and engineering skills

🟒 Domain Knowledge
🟒 Math and Stats
🟒 Hacking Skills (Engineering Skills)

Another way to show this is :

🟠 Business
🟠 Programming
🟠 Statistics
🟠 Communication

Based on the skills and their usage different roles are defined :

Data Scientist πŸ₯Ό : Generally someone who understands all the above aspects (Business + Programming + statistics + Communication)

Data Analyst πŸ‘©πŸ½β€πŸ’Ό : Someone who understands business and is able to communicate well with some knowledge of progamming/statisics

Research Engineer πŸ“Š : High programming + Statistics + Communication, might not have the Business understanding

Data Engineer πŸ‘¨πŸ½β€πŸ’» : High business + Programming (integrate different IT infrastructure (Data Sources/systems) to perform the task.)

Now let’s shift our gear to look at one of the popular process used for engineering a Data Science System.

CRISP - DM

Engineering Systems for Data Science involves two components :

Process 🧾 + Programming πŸ‘¨πŸ½β€πŸ’»

Question

What is a process then while building a system?

A process can be though of as Flow of Steps + Agile improvements (start with small components and improve it gradually)

One of the example of such a process to design engineering system for data science is CRISP-DM.


graph TD
Business_Understanding --> Data_Understanding
    Data_Understanding --> Business_Understanding
    Data_Understanding --> Data_Preparation
    Data_Preparation --> Modelling
    Modelling --> Data_Preparation
    Modelling --> Evaluation
    Evaluation --> Deployment
    Evaluation --> Business_Understanding


Business understanding πŸ‘©πŸ½β€πŸ’Ό

Data Understanding 🧹

Data Preparation πŸ•‹

Modelling πŸ“Š

Evaluation πŸ“‹

Deployment πŸš€

The Entire process is iteratively continuous.

Let’s Zoom into each component of the process and see what we can ask to better understand the requirement.





CRISP-DM_Business Understanding

Business Understanding

  • What are the business objectives?

  • Can data Science achieve those objects? (most of the times it just reveals the existing systems shortcoming)

  • How do we define success metrics?

  • Are there ethical consideration in data usage?

  • what have other industries achieved? (what is state of the art (SOTA))

CRISP - DM_Data Understanding, Preparation & Modelling :

Data Understanding :

  • What are the sources of data?

  • Does new data needs to be collected? (slow and expensive)

  • What bis the quality and quantity of data?

  • What do different data items represent?

  • Which data is relevant to the objective?

Agile steps for Data Preparation :

  • What are the different data formats?

  • Is there need for annotating data?

  • How can data be Extracted, transformed and loaded?

  • How to standardise (ensuring data has 0 mean and unit variance) and normalise data (data is between 0 and 1 where o and 1 are mapped to min and max)?

  • How to efficiently store data for analysis?

Data Modelling

  • What assumptions to make for the models?

  • Statistical or algorithmic modelling? (statistical - when we want to use simple sets of models but give strong concrete statistical evidence for what we find, whereas algorithmic one is when we throw a bunch of complex models and allow the machine to find the optimal one using algorithms )

  • Is clean data is sufficient for modelling?

  • Is the compute budget sufficient for modelling?

  • Are results are statistically significant?

CRISP - DM_Evalution & Deployment :

Evaluation :

Once Data scientistic has finished working on the modelling and proven it is statistically relevant they pass it on to evaluation.

  • Does model work correctly with test data?

  • Does model achieve business objective?

  • Does model meet performance requirements?

  • Is the model unbiased and robust?

  • What are the ways to improve the model

Deployment :

  • Where is the models to be deployed? (Mobile, Servers, drones etc)

  • What is the H/W, S/W stack for deployment

  • Does it meets performance requirements (Battery limitation , latency etc)

  • Does it violates privacy requirements

  • Does it meets user’s expectations

This whole process needs to be done iteratively along with

  1. Iteratively design and deployment β€”> MVP
  2. Revise expectation of success and value of data science
  3. Upgrade human and hardware resources

Instead of following an adhoc approach this approach is a much better way to know where the project is and what are the next steps