Cross-validation, sometimes called rotation estimation, is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent
data set. It is mainly used in settings where the goal is prediction, and one
wants to estimate how accurately a predictive model will perform in practice. In a prediction problem, a
model is usually given a dataset of known data on which training is run (training dataset), and a dataset of unknown data (or first seen data) against which the model is tested (testing dataset).
The goal of cross-validation is to define a dataset to “test” the model in the training phase
(i.e., the validation dataset), in order to limit problems like overfitting, give an insight on how the model will generalize to an independent dataset
(i.e., an unknown dataset, for instance from a real problem), etc.
One round of
cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset
(called the validation set or testing set). To reduce variability, multiple rounds of cross-validation are performed using different partitions,
and the validation results are averaged over the rounds. One of the main reasons for using
cross-validation instead of using the conventional validation (e.g. partitioning the data set into two sets of 70%
for training and 30% for the test,) is that the error (e.g. Root Mean Square Error) on the training set in the
conventional validation is not a useful estimator of model performance and thus
the error on the test data set does not properly represent the assessment of
model performance. This may be due to the fact that there is not enough data available
or there is not a good distribution and spread of data to partition it into
separate training and test sets in the conventional validation method. In these
cases, a fair way to properly estimate model prediction performance is to use cross-validation as a powerful general
technique. In summary, cross-validation
combines (averages) measures of fit (prediction error)
to correct for the optimistic nature of training error and derive a more accurate
estimate of model prediction performance.
No comments:
Post a Comment
If you have any doubt, let me know