Data
mining (the analysis step of
the “Knowledge Discovery in Databases” process, or KDD), an interdisciplinary subfield of computer science, is the computational process of
discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining
process is to extract information from a data set and transform it into an
understandable structure for further use. Aside from the raw
analysis step, it involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.
Process: The Knowledge Discovery in
Databases (KDD) process is commonly defined with the stages:
(1) Selection
(2) Pre-processing
(3) Transformation
(4) Data Mining
(5) Interpretation/Evaluation.
It exists, however, in many variations on this theme,
such as the Cross Industry
Standard Process for Data Mining (CRISP-DM) which defines six phases:
(1) Business Understanding
(2) Data Understanding
(3) Data Preparation
(4) Modeling
(5) Evaluation
(6) Deployment
or a simplified process such as (1) pre-processing,
(2) data mining, and (3) results validation.
Pre-Processing: It is the first step to be
done in the process of data mining. In pre-processing, we try to look at the
data. The target data set must be
assembled.
The data
mining can only uncover patterns and features (required) present in data, Target
data set must be large enough to contain the features and patterns. Pre-processing is essential to analyze the
multivariate data sets before data mining. The target
set is then cleaned. Data cleaning removes the observations containing noise and those with missing data.
Data Mining: In the data mining task, we simply do six tasks.
Anomaly detection, Association Rule
Learning, Clustering, Classification, Regression, and summarization. Anomaly detection means identification of unusual
data records, that might be interesting or data errors that require further
investigation. In Association rule learning, simply search for relationships between variables.
For example, a supermarket might gather data on customer purchasing habits. Using association rule learning,
the supermarket can determine which products are frequently bought together and
use this
information for marketing purposes. This is sometimes referred
to as market basket analysis.
In Clustering, the task of discovering groups and structures
in the data that are in some way or another “similar”, without using known structures in the data.
In Classification, the
task of generalizing known structure to apply to new data. For example, an email program
might attempt to classify an e-mail as “legitimate” or as “spam”. In Regression, attempts to find a function that models the
data with the least error. Finally in Summarization, provide a more compact
representation of the data set, including visualization and report generation.
Result
Verification: The final step of
knowledge discovery from data is to verify that the patterns produced by the
data mining algorithms occur in the wider data set. Not all patterns found by
the data mining algorithms are necessarily valid. It is common for the data
mining algorithms to find patterns in the training set which are not present in
the general data set. This is called overfitting. To overcome this, the evaluation uses a test set of data on which the data mining algorithm was not trained. The learned
patterns are applied to this test set, and the
resulting output is compared to the desired output. If the learned patterns do not meet the desired
standards, subsequently it is necessary to re-evaluate and change the pre-processing
and data mining steps. If the learned patterns do meet the desired standards,
then the final step is to interpret the learned patterns and turn them into
knowledge.
No comments:
Post a Comment
If you have any doubt, let me know