Foundations of Data Science
Learning Objectives :
In this explainable Article, we are going to discuss about Data Science, the tasks associated with it and how it differs from Machine Learning, Deep Learning etc. which is an important distinction to understand.
What is Data Science?
Data Science is the science of collecting, storing, processing, describing and modelling data.
Why are there multiple confusing definitions?
Since it is a assortment (collection) of several tasks, often times there is no clear boundary, as to if we perform all of these or some of it, then we can call ourself a Data Scientist.
Most of the times, attention on tasks depends on application/organization, for example if there are abundance of data in the organization, then there won't be much focus on data collection, rather processing it will be in more focus.
Thus it all depends on the organisation and the task we are performing.
What are the Tasks Associated with Data Science?
Collect
What is involved in data collections?
Data Collection, mainly depends on
i. The question, a data scientist is trying to answer.
ii. The environment in which the data scientist is working
for example a Data Scientist is working at an internet company, in this case most probably, the data would be present in more or less structured way i.e data exists within the organisation in a structured way.
Now, take an example of a Movie production house, which wants to know what people are saying about a new movie, mostly in this case we need to crawl, scrape the data from social media/ reviews sites etc. i.e data is available, but not with the organisation, it needs to be gathered from sources .
Let's take one more case, in which a data scientist is working with Doctors, and they need to know how effective a new drug is. In this case, initially we don't have data available at all and we need to design experiments to create these. This is one of the places where we need the knowledge of Statistics.
Store
Transactional & Operational Data (Structured Data) :
Traditionally most of the data was Transactional and operational data, i.e records of transactions that is happening, or records of things that happen daily to run the normal operations of the business. eg. invoices, insurance claims.
These are generally Structured Data, where a Relational database is used. These are generally optimised for SQL queries.
Data from multiple databases :
Let's take an example of a big telecommunication company, which provides mobile, DTH Cable, Broadband etc services and they need to integrate data into a common repository, normally for support analytics etc. for such a scenario a Data Warehouse which not only collects the data from multiple stores but also is optimised for Analytics (think aggregation of millions of rows for analytics.) is used. These are optimised for Analysis.
Unstructured Data :
With the advent of smartphones, IOT devices etc. a large volume of different types of data is being generated ie. data is being generated with high volume, high variety and high velocity. This is termed as an era of Big Data .
For these type of Data we use a Data Lake (eg. Hadoop). Which is used to store Big Data (collection of large amount of structured / unstructured, within or outside of organisation).
These Uncurated data ( structured/unstructured from variety of places) are stored here without them being useful now, with the thinking that they might be used in the future.
If data processing is to be performed on Big Data with millions of data items then performance could be a factor, in such a case Distributed Processing becomes a choice i.e Divide the data into chunks, distribute the processing to a clusters of computers/servers and then aggregate the result back. which Hadoop (Map Reduce) helps to do.
Summary
Relational Databases | Where we store structured data |
These are optimised for SQL queries. | |
Data Warehouses | Structured Data |
Curated data | |
Optimised for analysis | |
Data Lake | Structured and/or Unstructured data and not curated (dumped with the hope of being useful in future) |
Process
After the Data is collected and store, it needs to be processed to be useful for analysis or modelling.
Data Wrangling or Data Munging :
Let's take up an example of a HR dept. which stores data about all the employees and these are being sent to the Publishing dept. which takes and stores only certain relevant details about people who are authors.
So, this Selecting and converting of data to useful form is called as Data Wrangling/Munging.
We typically perform extract, transform and load on these data.
Data Cleaning :
The process of Data cleaning is an imp. aspect of Data Science, below steps describe the general process.
๐ฃ Fill missing Values (eg. substitute missing value with the mean of the data etc.)
๐ฃ Standardise keyword tags (eg. remove duplicate author details etc.)
๐ฃ correct spelling errors.
๐ฃ identify and remove outliers. and other steps ....
Data Scaling, normalising and Standardising :
Generally the data collected could have different scale for some of the data points and across features. To resolve this issue, we take in account below methods.
๐พ Scale : eg. kilometers to miles, dollars to rupees.
๐พ Normalise : zero mean, unit variance
๐พ Standardise : all values between 0 and 1.
We will discuss about these in detail in subsequent chapters.
Describe
Describing Data generally involves two steps, i.e visualising and summarising Data
Visualising Data :
A lot of the time, we would want to have a quick analysis of our records and to see if we could find any interesting facts that could help us in any business decision. This is where we plot graphs etc to visualise the stored data and come up with findings.
Summarising Data :
After we have visualized data, generally we would want to know answers to some queries. eg. what is the typical sale of Movie rentals daily? We could use mean, median or mode or any other statistical parameter for this or What is the typical variation of the TVs sold, in which case std. deviations, variance etc. could be helpful. This Visualizing and Summarizing Data deals with Descriptive Statistics and the other part is Inferential statistics.
Model
Statistical Modelling :
Coming up with models of data and then making certain inference on top of the model is an imp. part of Data Science. A lot of the physical/Biological quantity, they follow certain distribution. let's say normal distribution. for eg. blood sugar level, cholesterol level etc.
So if we take a large enough quantity and they tend to follow normal distribution then we can make a lot of inferences about it due to the fact these distribution are studied a lot and holds certain properties that can be used to infer the data.
To make robust conclusion on the data, we need to understand the underlying model and relations in data. In general Statistical Modelling are used for below objectives.
๐พ Modelling underlying data Distribution.
๐พ Modelling underlying relations in data.
๐พ Formulate and test hypothesis.
๐พ Give Statistical guarantee.
In statistical Modelling, the models considered are simple which is because in most of the cases, the data is not high-dimensional.
Algorithmic Modelling :
In statistical modelling we assumed simple models, that allowed robust statistical analysis and give statistical gurantees (like p value, goodness of fit-test etc).
But in many real world cases the relationship is much more complex then a linear model.
In such a case we use Algorithmic Modelling. that's where Machine Learning and Deep Learning comes in picture, where they provide a family of many complex functions for modelling the data i.e to Estimate function f using data, optimisation techniques etc.
Focus on prediction correctly is given more importance (not much on the underlying data) as compared to statistical modelling.
In applications where we don't care much about the underlying relationship b/w data, i.e why certain thing are happening, we tend to go for algorithmic modelling and vice versa.
Statistical Modelling | Algorithmic Modelling |
---|---|
Simple, Intuitive Models | Complex, Flexible Models |
More Suited for Low-Dim. data | Can work with High-Dim Data |
Robust statistical analysis is possible | Not suitable for robust statistical analysis |
Focus on Interpretability | Focus on Prediction |
Data Lean Model, more of statistics | Data Hungry Models, more of ML, DL |
eg. Linear Regression, Logistic Regression, Linear Discriminant Analysis etc | eg. Linear Regression, Logistic Regression, Linear Discriminant Analysis, Decision Tree, KNN, SVMs, Naive Bayes, Neural Network etc. |
When we have large amount of high dimensional data and we want to learn very complex relationship between the output and input we use a special class of complex ML models and algorithms collectively referred to as Deep Learning .
Thus Machine Learning, Deep Learning is just a part of the Algorithmic modeling part of Data Science.
Difference Between Artificial Intelligence (AI), Machine Learning, Deep Learning and Data Science?
In general AI is defined as building systems or agents that demonstrate "intelligence".
Still there is a confusion about how it relates to DS, ML DL etc and how different it is?
Let's see at the tasks that constitute AI? and what of these could relate to Data Science. Problem Solving :
Let's take a example, Our agent has to traverse through a maze, then in this case , our first priority is to find efficient search algos (eg. BFS, DFS, A* etc.) and which is not data driven (i.e no data, No modelling).
So we can say that problem solving is a part of artificial intelligence that encompasses a number of techniques such as a tree, B-tree, heuristic algorithms to solve a problem.
We can also say that a problem-solving agent is a result-driven agent and always focuses on satisfying the goals.
๐ค Knowledge Representation :
Now let's say the maze has certain complex rules, like if there is green ball in the cell then, the right cell will have more power and similarly if there is red ball in the cell, then the right cell will have pit in it.
In this case also we don't need data, but we need to use propositional and first order logic to represent this knowledge of rule.
๐ค Reasoning :
Once the Knowledge Representation is completed, next steps is to apply reasoning to navigate through and make a Decision and is termed as ๐ฆง Decision Making.
๐ค Expert System :
A lot of real world applications uses rule based system specifically designed by experts which are called Expert systems.
These rules are encoded using Knowledge representation, and the program will perform reasoning and execute these rules to complete the task.
But in lot of other cases expert systems can't be used as rules may be too complex, inexpressible, rules maybe unknown etc.
The Alternative approach is to learn from large amounts of data i.e use Algorithmic Modelling (ML, DL) and predict (make decisions) This ๐ฅ Data-Driven part of AI, i.e training agents to take decisions is what intersects with Data Science.
๐ค Perception, Communication and Actuation:
Next part in AI process, is Communication, i.e an agent should be able to communicate to humans, and for this Natural language understanding and Natural Language Generation is imp. which can collectively be called Natural Language Processing(NLP) and modern NLP is completely data driven (i.e mainly uses ML, DL)
Similarly Perception, is the ability to see and listen (Computer vision, Speech recognition) which is also being done using DL,
and at last Actuation is mainly related to actuation in robotics . All these i.e NLP, Vision, Speech Recognition, Actuations etc are data driven and mostly uses ML, DL.
Thus since Decision Making and Perception, Communication, Actuation are now mostly Data Driven, these are the steps where AI intersects with Data Science and ML, DL.
Share this Explainable Article with Friends and colleagues and let us know in the comments section, how it helped you land entry level data science job.
In the Next Explainable Article, we will discuss about the Process involved in a Data Science project.