Why: We want to get business insights from Big Data to make a more informed decision.
“KDD is the non-trivial process of identifying valid, novel, potentially useful and ultimately understandable patterns in data.”Fayyad et al., 1996
You see, not only ‘Data Mining’ might be a core process of getting the insight out of the data, but it is far from the only thing we have to do to get these valuable insights.
In a flash, here are some other tasks you must do before and after Data Mining!
Table of Contents
Not all data are equally valuable. Each data point holds a potential treasure trove of insights waiting to be unlocked. It is our responsibility as data scientists to meticulously extract meaning and value from these diverse data sets. This is where domain knowledge and expertise become indispensable, playing a vital role in the intricate process of Data Mining and all types of Machine Learning algorithms. With the right combination of technical skills and subject-specific understanding, we can unravel the hidden patterns and make informed decisions that drive innovation and progress.
I would say more about this illustration that KDD is an iterative process. Your insight might not be discovered all at once. Below is my attempt to explain each of these 5 processes.
1. Data Selection
Output: Target Data
- Select relevant features
The first step is always crucial. Which data to select is based on your posted question. The quality of this first question is mostly from your understanding of the topic and the data structure itself. This is why a data analyst must work hand in hand with the field experts from the beginning.
- What data do I have at all?
- Which data do I need to answer my question?
- Which additional data can I use?
Output: Preprocessed and Clean Data
- Drop missing or irrelevant features
- Fill missing but relevant values, but how? Average?
- Screen out the noise
- Remove outliers, if needed.
- Matching terms from different sources
This is no doubt the most effortful and time-consuming task of all steps. You will find your IoT far from being ideal. But do not slack off on this, remember:
Garbage In = Garbage Out
Output: Mapped/Labeled Data
- Combine data sets
- Transform data types (e.g., label data)
- Map related features
What can be consolidated, and what must be transformed? This is a step before the mining itself; you must think about the end result now to maximize the chance that you would find a pattern with classification or characterization.
Automated approaches to reduce and combine features
- Principal Component Analysis (PCA)
- Linear Discriminant Analysis (LDA)
You see, raw data is sometimes too vague. So instead, think about a rate of change or a derived feature that can be characterized, then transform your data into this form before blindly running an algorithm to find a pattern.
This takes trial and error. Therefore, KDD is an iterative process ; )
4. Data Mining
- Outlier Detection: K-Nearest Neighbor, Support Vector Machines, Neural Networks, Bayesian Networks
- Association Rule Mining: Apriori Algorithm
- Classification: Naïve Bayes, K-Nearest Neighbors, Decision Tree, Support Vector Machine, Neural Networks
- Clustering K-Means (Partition based), Birch (Hierarchical), DBSCAN (Density-Based)
- Regression: Linear Regression, Lasso Regression, Logistic Regression, Support Vector Regression, Multivariate Regression, Neural Networks
If you do it right, you will get a pattern that might lead to a model, which you can then use to train your machine learning algorithm. Ultimately, you will be able to predict the value.
Output: Business Insights
- Write a report about new insights
How does this insight help us make a better decision? The presentation is crucial to convince your audience about the way forward. Learn a bit about How to Use Charts and Diagrams to Present Your Data to Customers
Data Mining plays a crucial role in deciphering the vast amounts of IoT data, acting as a powerful tool for extracting valuable insights. However, it’s important to note that Data Mining is not a standalone process; rather, it works synergistically with other essential procedures. For instance, today we delved into the intricate steps of KDD (Knowledge Discovery in Database), a comprehensive framework that encompasses various techniques like data preprocessing, data mining, and result interpretation. By integrating these methodologies, we can uncover hidden patterns, trends, and knowledge from complex datasets. Moving forward, it’s time to roll up your sleeves and embark on a hands-on project, where we can put these techniques into practice, further enhancing our understanding and application of Data Mining in real-world scenarios.
Kaggle is a great place to start!
You will find all types of interesting data sets and codes from other data analysts. Have fun!