Engineering Data Science Systems
What is Systems Thinking? π€
In a more broader sense systems thinking can be defined as a way of thinking about systems that is globalπ and encompassing rather than focused on particular issue.
Understand the larger context and then make choices around the design that we are building.
Even if we are building a small part of a large system, we should have in mind what the larger goal is and doing so is Systems thinking.
Letβs take an example of engineering a Data Science System
Question
The first answer that comes to mind is Data . But what else Data Science System = Data + ββ + ββ + ββ β¦β¦.
One way to fill the gap is by understanding the process involved to engineer a Data Science System.
Data Science Process
Thinking about it, an easy guess would be domain knowledge along with Maths, Stats and engineering skills
π’ Domain Knowledge | |
π’ Math and Stats | |
π’ Hacking Skills (Engineering Skills) |
Another way to show this is :
π Business | |
π Programming | |
π Statistics | |
π Communication |
Based on the skills and their usage different roles are defined :
Data Scientist π₯Ό : Generally someone who understands all the above aspects (Business + Programming + statistics + Communication)
Data Analyst π©π½βπΌ : Someone who understands business and is able to communicate well with some knowledge of progamming/statisics
Research Engineer π : High programming + Statistics + Communication, might not have the Business understanding
Data Engineer π¨π½βπ» : High business + Programming (integrate different IT infrastructure (Data Sources/systems) to perform the task.)
Now letβs shift our gear to look at one of the popular process used for engineering a Data Science System.
CRISP - DM
Engineering Systems for Data Science involves two components :
Process π§Ύ + Programming π¨π½βπ»
Question
What is a process then while building a system?
A process can be though of as Flow of Steps + Agile improvements (start with small components and improve it gradually)
One of the example of such a process to design engineering system for data science is CRISP-DM.
graph TD
Business_Understanding --> Data_Understanding
Data_Understanding --> Business_Understanding
Data_Understanding --> Data_Preparation
Data_Preparation --> Modelling
Modelling --> Data_Preparation
Modelling --> Evaluation
Evaluation --> Deployment
Evaluation --> Business_Understanding
Business understanding π©π½βπΌ
Data Understanding π§Ή
Data Preparation π
Modelling π
Evaluation π
Deployment π
The Entire process is iteratively continuous.
Letβs Zoom into each component of the process and see what we can ask to better understand the requirement.
CRISP-DM_Business Understanding
Business Understanding
-
What are the business objectives?
-
Can data Science achieve those objects? (most of the times it just reveals the existing systems shortcoming)
-
How do we define success metrics?
-
Are there ethical consideration in data usage?
-
what have other industries achieved? (what is state of the art (SOTA))
CRISP - DM_Data Understanding, Preparation & Modelling :
Data Understanding :
-
What are the sources of data?
-
Does new data needs to be collected? (slow and expensive)
-
What bis the quality and quantity of data?
-
What do different data items represent?
-
Which data is relevant to the objective?
Agile steps for Data Preparation :
-
What are the different data formats?
-
Is there need for annotating data?
-
How can data be Extracted, transformed and loaded?
-
How to standardise (ensuring data has 0 mean and unit variance) and normalise data (data is between 0 and 1 where o and 1 are mapped to min and max)?
-
How to efficiently store data for analysis?
Data Modelling
-
What assumptions to make for the models?
-
Statistical or algorithmic modelling? (statistical - when we want to use simple sets of models but give strong concrete statistical evidence for what we find, whereas algorithmic one is when we throw a bunch of complex models and allow the machine to find the optimal one using algorithms )
-
Is clean data is sufficient for modelling?
-
Is the compute budget sufficient for modelling?
-
Are results are statistically significant?
CRISP - DM_Evalution & Deployment :
Evaluation :
Once Data scientistic has finished working on the modelling and proven it is statistically relevant they pass it on to evaluation.
-
Does model work correctly with test data?
-
Does model achieve business objective?
-
Does model meet performance requirements?
-
Is the model unbiased and robust?
-
What are the ways to improve the model
Deployment :
-
Where is the models to be deployed? (Mobile, Servers, drones etc)
-
What is the H/W, S/W stack for deployment
-
Does it meets performance requirements (Battery limitation , latency etc)
-
Does it violates privacy requirements
-
Does it meets userβs expectations
This whole process needs to be done iteratively along with
- Iteratively design and deployment β> MVP
- Revise expectation of success and value of data science
- Upgrade human and hardware resources
Instead of following an adhoc approach this approach is a much better way to know where the project is and what are the next steps