Pages

Data Mining


Data mining (the analysis step of the “Knowledge Discovery in Databases” process, or KDD), an interdisciplinary subfield of computer science, is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Aside from the raw analysis step, it involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.

Process: The Knowledge Discovery in Databases (KDD) process is commonly defined with the stages:
(1) Selection
(2) Pre-processing
(3) Transformation
(4) Data Mining
(5) Interpretation/Evaluation.

It exists, however, in many variations on this theme, such as the Cross Industry Standard Process for Data Mining (CRISP-DM) which defines six phases:

(1) Business Understanding
(2) Data Understanding
(3) Data Preparation
(4) Modeling
(5) Evaluation
(6) Deployment

or a simplified process such as (1) pre-processing, (2) data mining, and (3) results validation.

Pre-Processing: It is the first step to be done in the process of data mining. In pre-processing, we try to look at the data. The target data set must be assembled.
The data mining can only uncover patterns and features (required) present in data, Target data set must be large enough to contain the features and patterns. Pre-processing is essential to analyze the multivariate data sets before data mining. The target set is then cleaned. Data cleaning removes the observations containing noise and those with missing data.

Data Mining:  In the data mining task, we simply do six tasks. Anomaly detection, Association  Rule Learning, Clustering, Classification, Regression, and summarization.  Anomaly detection means identification of unusual data records, that might be interesting or data errors that require further investigation. In  Association rule learning, simply search for relationships between variables. For example, a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this
information for marketing purposes. This is sometimes referred to as market basket analysis.
In  Clustering, the task of discovering groups and structures in the data that are in some way or another “similar”, without using known structures in the data. In Classification,  the task of generalizing known structure to apply to new data. For example, an email program might attempt to classify an e-mail as “legitimate” or as “spam”. In Regression,  attempts to find a function that models the data with the least error. Finally in Summarization,  provide a more compact representation of the data set, including visualization and report generation.

Result Verification: The final step of knowledge discovery from data is to verify that the patterns produced by the data mining algorithms occur in the wider data set. Not all patterns found by the data mining algorithms are necessarily valid. It is common for the data mining algorithms to find patterns in the training set which are not present in the general data set. This is called overfitting. To overcome this, the evaluation uses a test set of data on which the data mining algorithm was not trained. The learned patterns are applied to this test set, and the resulting output is compared to the desired output. If the learned patterns do not meet the desired standards, subsequently it is necessary to re-evaluate and change the pre-processing and data mining steps. If the learned patterns do meet the desired standards, then the final step is to interpret the learned patterns and turn them into knowledge.

No comments:

Post a Comment

If you have any doubt, let me know

Email Subscription

Enter your email address:

Delivered by FeedBurner

INSTAGRAM FEED