Engineering Data Science Systems

What is Systems Thinking? 🤖

In a more broader sense systems thinking can be defined as a way of thinking about systems that is global🌍 and encompassing rather than focused on particular issue.

Understand the larger context and then make choices around the design that we are building.

Even if we are building a small part of a large system, we should have in mind what the larger goal is and doing so is Systems thinking.

Let’s take an example of engineering a Data Science System

Question

what do you think would be the components in a Data Science System

The first answer that comes to mind is Data . But what else Data Science System = Data + —— + —— + —— …….

One way to fill the gap is by understanding the process involved to engineer a Data Science System.

Data Science Process

Thinking about it, an easy guess would be domain knowledge along with Maths, Stats and engineering skills


🟢 Domain Knowledge
🟢 Math and Stats
🟢 Hacking Skills (Engineering Skills)

Another way to show this is :


🟠 Business
🟠 Programming
🟠 Statistics
🟠 Communication

Based on the skills and their usage different roles are defined :

Data Scientist 🥼 : Generally someone who understands all the above aspects (Business + Programming + statistics + Communication)

Data Analyst 👩🏽‍💼 : Someone who understands business and is able to communicate well with some knowledge of progamming/statisics

Research Engineer 📊 : High programming + Statistics + Communication, might not have the Business understanding

Data Engineer 👨🏽‍💻 : High business + Programming (integrate different IT infrastructure (Data Sources/systems) to perform the task.)

Now let’s shift our gear to look at one of the popular process used for engineering a Data Science System.

CRISP - DM

Engineering Systems for Data Science involves two components :

Process 🧾 + Programming 👨🏽‍💻

Question

What is a process then while building a system?

A process can be though of as Flow of Steps + Agile improvements (start with small components and improve it gradually)

One of the example of such a process to design engineering system for data science is CRISP-DM.

graph TD
Business_Understanding --> Data_Understanding
    Data_Understanding --> Business_Understanding
    Data_Understanding --> Data_Preparation
    Data_Preparation --> Modelling
    Modelling --> Data_Preparation
    Modelling --> Evaluation
    Evaluation --> Deployment
    Evaluation --> Business_Understanding

Business understanding 👩🏽‍💼

Data Understanding 🧹

Data Preparation 🕋

Modelling 📊

Evaluation 📋

Deployment 🚀

The Entire process is iteratively continuous.

Let’s Zoom into each component of the process and see what we can ask to better understand the requirement.

CRISP-DM_Business Understanding

Business Understanding

What are the business objectives?
Can data Science achieve those objects? (most of the times it just reveals the existing systems shortcoming)
How do we define success metrics?
Are there ethical consideration in data usage?
what have other industries achieved? (what is state of the art (SOTA))

CRISP - DM_Data Understanding, Preparation & Modelling :

Data Understanding :

What are the sources of data?
Does new data needs to be collected? (slow and expensive)
What bis the quality and quantity of data?
What do different data items represent?
Which data is relevant to the objective?

Agile steps for Data Preparation :

What are the different data formats?
Is there need for annotating data?
How can data be Extracted, transformed and loaded?
How to standardise (ensuring data has 0 mean and unit variance) and normalise data (data is between 0 and 1 where o and 1 are mapped to min and max)?
How to efficiently store data for analysis?

Data Modelling

What assumptions to make for the models?
Statistical or algorithmic modelling? (statistical - when we want to use simple sets of models but give strong concrete statistical evidence for what we find, whereas algorithmic one is when we throw a bunch of complex models and allow the machine to find the optimal one using algorithms )
Is clean data is sufficient for modelling?
Is the compute budget sufficient for modelling?
Are results are statistically significant?

CRISP - DM_Evalution & Deployment :

Evaluation :

Once Data scientistic has finished working on the modelling and proven it is statistically relevant they pass it on to evaluation.

Does model work correctly with test data?
Does model achieve business objective?
Does model meet performance requirements?
Is the model unbiased and robust?
What are the ways to improve the model

Deployment :

Where is the models to be deployed? (Mobile, Servers, drones etc)
What is the H/W, S/W stack for deployment
Does it meets performance requirements (Battery limitation , latency etc)
Does it violates privacy requirements
Does it meets user’s expectations

This whole process needs to be done iteratively along with

Iteratively design and deployment —> MVP
Revise expectation of success and value of data science
Upgrade human and hardware resources

Instead of following an adhoc approach this approach is a much better way to know where the project is and what are the next steps