Batch Gradient Descent: Batch Gradient Descent 涉及在每一步对完整训练集的计算,因此在非常大的训练数据上它非常慢。因此,进行 Batch GD 的计算成本非常高。然而,这对于凸的或相对平滑的误差流形来说非常有用。此外,Batch GD 可以很好地扩展特征数量。
随机梯度下降: SGD 试图解决批量梯度下降的主要问题,即使用整个训练数据来计算每一步的梯度。 SGD 本质上是随机的,即它在每一步选取一个“随机”的训练数据实例,然后计算梯度使其更快,因为一次操作的数据要少得多,这与 Batch GD 不同。
SGD 的随机特性有一个缺点,即一旦它接近最小值,它就不会稳定下来,而是反弹,这为我们提供了一个很好的模型参数值,但不是最佳值,这可以通过减少学习来解决可以减少反弹的每个步骤的速率,并且 SGD 可能会在一段时间后稳定在全局最小值。
S.NO. | Batch Gradient Descent | Stochastic Gradient Descent |
1. | Computes gradient using the whole Training sample | Computes gradient using a single Training sample |
2. | Slow and computationally expensive algorithm | Faster and less computationally expensive than Batch GD |
3. | Not suggested for huge training samples. | Can be used for large training samples. |
4. | Deterministic in nature. | Stochastic in nature. |
5. | Gives optimal solution given sufficient time to converge. | Gives good solution but not optimal. |
6. | No random shuffling of points are required. | The data sample should be in a random order, and this is why we want to shuffle the training set for every epoch. |
7. | Can’t escape shallow local minima easily. | SGD can escape shallow local minima more easily. |
8. | Convergence is slow. | Reaches rthe convergence much faster. |