Why: We want to get business insights from Big Data to make a more informed decision.
“KDD is the non-trivial process of identifying valid, novel, potentially useful and ultimately understandable patterns in data.”Fayyad et al., 1996
You see, not only ‘Data Mining’ might be a core process of getting the insight out of the data, but it is far from the only thing we have to do to get these valuable insights.
In a flash, here are some other tasks you must do before and after Data Mining!
Table of Contents
Not all data are helpful. We are responsible for making sense of them! That is why domain knowledge/expertise is crucial to Data Mining and any type of Machine Learning.
I would say more about this illustration that KDD is an iterative process. Your insight might not be discovered all at once. Below is my attempt to explain each of these 5 processes.
1. Data Selection
Output: Target Data
- Select relevant features
The first step is always crucial. Which data to select is based on your posted question. The quality of this first question is mostly from your understanding of the topic and the data structure itself. This is why a data analyst must work hand in hand with the field experts from the beginning.
- What data do I have at all?
- Which data do I need to answer my question?
- Which additional data can I use?
Output: Preprocessed and Clean Data
- Drop missing or irrelevant features
- Fill missing but relevant values, but how? Average?
- Screen out the noise
- Remove outliers, if needed.
- Matching terms from different sources
This is no doubt the most effortful and time-consuming task of all steps. You will find your IoT far from being ideal. But do not slack off on this, remember:
Garbage In = Garbage Out
Output: Mapped/Labeled Data
- Combine data sets
- Transform data types (e.g., label data)
- Map related features
What can be consolidated, and what must be transformed? This is a step before the mining itself; you must think about the end result now to maximize the chance that you would find a pattern with classification or characterization.
Automated approaches to reduce and combine features
- Principal Component Analysis (PCA)
- Linear Discriminant Analysis (LDA)
You see, raw data is sometimes too vague. So instead, think about a rate of change or a derived feature that can be characterized, then transform your data into this form before blindly running an algorithm to find a pattern.
This takes trial and error. Therefore, KDD is an iterative process ; )
4. Data Mining
- Outlier Detection: K-Nearest Neighbor, Support Vector Machines, Neural Networks, Bayesian Networks
- Association Rule Mining: Apriori Algorithm
- Classification: Naïve Bayes, K-Nearest Neighbors, Decision Tree, Support Vector Machine, Neural Networks
- Clustering K-Means (Partition based), Birch (Hierarchical), DBSCAN (Density-Based)
- Regression: Linear Regression, Lasso Regression, Logistic Regression, Support Vector Regression, Multivariate Regression, Neural Networks
If you do it right, you will get a pattern that might lead to a model, which you can then use to train your machine learning algorithm. Ultimately, you will be able to predict the value.
Output: Business Insights
- Write a report about new insights
How does this insight help us make a better decision? The presentation is crucial to convince your audience about the way forward. Learn a bit about How to Use Charts and Diagrams to Present Your Data to Customers
Data Mining is a core part of how we can make sense of IoT data, but it is accompanied by other crucial processes that cannot be done alone. For example, today, we looked at KDD or Knowledge Discovery in Database steps. Next, you will have to get your hand dirty and start doing a project.
Kaggle is a great place to start!
You will find all types of interesting data sets and codes from other data analysts. Have fun!