Gradient Descent is an iterative machine learning optimization algorithm used to reduce the cost function. In machine learning, we use cost function to know the accuracy of our model by using different evaluation metrics to compare our predictions with the known targets. If you need more details on the evaluation metrics, visit my this post OR if you want to try this in Collab or Binder, visit this link.
Gradient Descent can be best explained with the classical analogy of mountain.
A person is stuck in the mountains and is trying to get down (i.e. trying to find the global minimum). There is heavy fog such that visibility is extremely low. Therefore, the path down the mountain is not visible, so they must use local information to find the minimum. They can use the method of gradient descent, which involves looking at the steepness of the hill at their current position, then proceeding in the direction with the steepest descent (i.e. downhill). If they were trying to find the top of the mountain (i.e. the maximum), then they would proceed in the direction of steepest ascent (i.e. uphill). Using this method, they would eventually find their way down the mountain or possibly get stuck in some hole (i.e. local minimum or saddle point), like a mountain lake. However, assume also that the steepness of the hill is not immediately obvious with simple observation, but rather it requires a sophisticated instrument to measure, which the person happens to have at the moment. It takes quite some time to measure the steepness of the hill with the instrument, thus they should minimize their use of the instrument if they wanted to get down the mountain before sunset. The difficulty then is choosing the frequency at which they should measure the steepness of the hill so not to go off track.
In this analogy, the person represents the algorithm, and the path taken down the mountain represents the sequence of parameter settings that the algorithm will explore. The steepness of the hill represents the slope of the error surface at that point. The instrument used to measure steepness is differentiation (the slope of the error surface can be calculated by taking the derivative of the squared error function at that point). The direction they choose to travel in aligns with the gradient of the error surface at that point. The amount of time they travel before taking another measurement is the learning rate of the algorithm. See Backpropagation § Limitations for a discussion of the limitations of this type of "hill descending" algorithm.
[Source : Wikipedia]Gradient Descent Intuition :-
Source : Andrew NG (Coursera Machine Learning Course) |
Gradient Descent Algorithm :-
Source : Andrew NG (Coursera Machine Learning Course) |
Here alpha is the learning rate.
If alpha is too small, gradient descent can be very slow.
If alpha is too large, gradient descent can overshoot the minima. It may even fail to converge or diverge.
Comments
Post a Comment