Optimization: It may happen that the loss function may not
be convex. We want loss function as convex because then we can apply the gradient
descent method and minimize the loss (error). There may be a number of global
minima and maxima. So the target of any optimization technique is to reach the
global minima. However, it is possible that we get trapped in local minima.
Local minima are acceptable only when an error is not much. There are three
different types of optimization techniques.
Batch Optimization: If the training
sample size is very large then we use a batch optimization technique to find the
loss function. But for finding a loss function, we must have
access to all training samples. Then find all training samples which are not
correctly classified. Find the loss of individual training samples and find the
sum of all loss functions to find the total loss function.
Upsides: Fewer
updates to the model mean this variant of gradient descent is more
computationally efficient than stochastic gradient descent.
The decreased update frequency
results in a more stable error gradient and may result in a more stable
convergence on some problems
The separation of the
calculation of prediction errors and the model update lends the algorithm to
parallel processing based implementations
Downs: The more stable error gradient may result in premature the convergence of the model to a less optimal set of parameters. The updates at
the end of the training epoch require the additional complexity of accumulating
prediction errors across all training examples. It requires the entire training dataset in memory and available to the algorithm. Model updates, and in turn
training speed may become very slow for large datasets.
No comments:
Post a Comment
If you have any doubt, let me know