TensorFlow 中的优化器

优化器是用于通过调整各种参数和权重来减少损失（错误）的技术或算法，从而最小化损失函数，更快地提供更好的模型精度。

TensorFlow 中的优化器

Optimizer 是 Tensorflow 中的扩展类，它使用模型的参数进行初始化，但没有给它张量。 Tensorflow 提供的基本优化器是：

tf.train.Optimizer - Tensorflow version 1.x
tf.compat.v1.train.Optimizer - Tensorflow version 2.x

这个类从不直接使用，但它的子类被实例化。

梯度下降算法

在解释之前，让我们先了解一下其他算法在其之上的算法。即梯度下降。梯度下降将权重和损失函数联系起来，因为梯度意味着变化的量度，梯度下降算法确定应该做些什么来使用偏导数来最小化损失函数——比如加 0.7、减 0.27 等。但是当它陷入局部最小值时就会出现障碍在大型多维数据集的情况下，而不是全局最小值。

Syntax: tf.compat.v1.train.GradientDescentOptimizer(learning_rate, 
                                                    use_locking,
                                                    name = 'GradientDescent)
Parameters: 
learning_rate: rate at which algorithm updates the parameter. 
               Tensor or float type of value. 
use_locking: Use locks for update operations if True
name: Optional name for the operation

TensorFlow Keras 优化器类

Tensorflow 主要支持 9 个优化器类，包括它的基类（Optimizer）。

梯度下降
新元
阿达格拉德
RMSprop
阿达三角洲
亚当
阿达麦克斯
那达慕
FTRL

SGD 优化器（随机梯度下降）

随机梯度下降 (SGD) 优化方法对每个训练示例执行参数更新。在大型数据集的情况下，SGD 执行冗余计算，导致频繁更新具有高方差，从而导致目标函数变化很大。

Syntax: tf.kears.optimizers.SGD(learning_rate = 0.01,
                                momentum=0.0, 
                                nesterov=False, 
                                name='SGD', 
                                **kwargs)
Parameters: 
learning_rate: rate at which algorithm updates the parameter. 
               Tensor or float type of value.Default value is 0.01
momentum: accelerates gradient descent in appropriate
          direction. Float type of value. Default value is 0.0
nesterov: Whether or not to apply Nesterov Momentum.
          Boolean type of value. Default value is False.
name: Optional name for the operation
**kwargs: Keyworded variable length argument length.

好处：

需要更少的内存。
频繁更改模型参数。
如果使用 Momentum，则有助于降低噪音。

缺点：

高方差
计算成本高

AdaGrad优化器

AdaGrad 代表自适应梯度算法。 AdaGrad 优化器会修改学习率，特别是使用个别特征。即数据集中的某些权重可能具有与其他权重不同的学习率。

Syntax: tf.keras.optimizers.Adagrad(learning_rate=0.001,
                                     initial_accumulator_value=0.1,
                                     epsilon=1e-07,
                                     name="Adagrad",
                                     **kwargs)
Parameters: 
learning_rate: rate at which algorithm updates the parameter. 
               Tensor or float type of value.Default value is 0.001
initial_accumulator_value: Starting value for the per parameter 
                           momentum. Floating point type of value.
                           Must be non-negative.Default value is 0.1
epsilon: Small value used to sustain numerical stability. 
         Floating point type of value. Default value is 1e-07.
name: Optional name for the operation
**kwargs: Keyworded variable length argument length

好处：

最适合稀疏数据集
迭代更新学习率

缺点：

学习率随着神经网络深度的增加而变小
可能导致死神经元问题

RMSprop优化器

RMSprop 代表均方根传播。 RMSprop 优化器不会让梯度为动量累积，而是仅在特定的固定窗口中累积梯度。它可以被认为是 AdaGrad 的更新版本，几乎没有改进。 RMSprop 使用简单动量而不是 Nesterov 动量。

Syntax: tf.keras.optimizers.RMSprop(learning_rate=0.001, 
                                    rho=0.9, 
                                    momentum=0.0, 
                                    epsilon=1e-07, 
                                    centered=False,
                                    name='RMSprop', 
                                    **kwargs)
Parameters:
learning_rate: rate at which algorithm updates the parameter. 
               Tensor or float type of value.Default value is 0.001
rho: Discounting factor for gradients. Default value is 0.9
momentum: accelerates rmsprop in appropriate direction. 
          Float type of value. Default value is 0.0
epsilon: Small value used to sustain numerical stability. 
         Floating point type of value. Default value is 1e-07
centered: By this gradients are normalised by the variance of 
          gradient. Boolean type of value. Setting value to True may
          help with training model however it is computationally 
          more expensive. Default value if False.
name: Optional name for the operation
**kwargs: Keyworded variable length argument length.

好处：

学习率会自动调整。
每个参数的离散学习率

缺点：学习慢

Adadelta优化器

Adaptive Delta (Adadelta) 优化器是 AdaGrad 的扩展（类似于 RMSprop 优化器），但是，Adadelta 放弃了学习率的使用，将其替换为平方 delta 的指数移动均值（当前权重与更新权重之间的差异）。它还试图消除衰减学习率问题。

Syntax: tf.keras.optimizers.Adadelta(learning_rate=0.001, 
                                     rho=0.95, 
                                     epsilon=1e-07, 
                                     name='Adadelta',
                                     **kwargs)
Parameters:
learning_rate: rate at which algorithm updates the parameter. 
               Tensor or float type of value.Default value is 0.001
rho: Decay rate. Tensor or Floating point type of value.
     Default value is 0.95
epsilon: Small value used to sustain numerical stability. 
         Floating point type of value. Default value is 1e-07
name: Optional name for the operation
**kwargs: Keyworded variable length argument length

优点：设置不需要默认学习率。

缺点：计算成本高

亚当优化器

自适应矩估计 (Adam) 是当今最常用的优化技术之一。在这种方法中，计算每个参数的自适应学习率。这种方法结合了 RMSprop 和动量的优点。即存储先前梯度的衰减平均值和先前平方梯度。

Syntax: tf.keras.optimizers.Adam(leaarning_rate=0.001, 
                                 beta_1=0.9, 
                                 beta_2=0.999, 
                                 epsilon=1e-07, 
                                 amsgrad=False,
                                 name='Adam', 
                                 **kwargs)
Parameters:
learning_rate: rate at which algorithm updates the parameter. 
               Tensor or float type of value.Default value is 0.001
beta_1: Exponential decay rate for 1st moment. Constant Float 
        tensor or float type of value. Default value is 0.9
beta_2: Exponential decay rate for 2nd moment. Constant Float 
        tensor or float type of value. Default value is 0.999
epsilon: Small value used to sustain numerical stability. 
         Floating point type of value. Default value is 1e-07
amsgrad: Whether to use AMSGrad variant or not. 
         Default value is False.
name: Optional name for the operation
**kwargs: Keyworded variable length argument length

好处：

易于实施
需要更少的内存
计算效率高

缺点：

可能有体重衰减问题
有时可能无法收敛到最优解

AdaMax优化器

AdaMax 是对 Adam 优化器的改进。它建立在低阶矩的自适应近似之上（基于无穷范数）。有时在嵌入的情况下，AdaMax 被认为比 Adam 更好。

Syntax: tf.keras.optimizers.Adamax(learning_rate=0.001, 
                                   beta_1=0.9, 
                                   beta_2=0.999, 
                                   epsilon=1e-07,
                                   name='Adamax', 
                                   **kwargs)
Parameters:
learning_rate: rate at which algorithm updates the parameter. 
               Tensor or float type of value.Default value is 0.001
beta_1: Exponential decay rate for 1st moment. Constant Float 
        tensor or float type of value. Default value is 0.9
beta_2: Exponential decay rate for weighted infinity norm. 
        Constant Float tensor or float type of value. 
        Default value is 0.999
epsilon: Small value used to sustain numerical stability. 
         Floating point type of value. Default value is 1e-07
name: Optional name for the operation
**kwargs: Keyworded variable length argument length

好处：

无限阶使算法稳定。
需要较少的超参数调整

缺点：泛化问题

NAdam优化器

NAdam 是 Nesterov 和 Adam 优化器的缩写形式。 NAdam 使用 Nesterov 动量来更新梯度，而不是 Adam 使用的普通动量。

Syntax: tf.keras.optimizers.Nadam(learning_rate=0.001, 
                                  beta_1=0.9, 
                                  beta_2=0.999, 
                                  epsilon=1e-07,
                                  name='Nadam', 
                                  **kwargs)
Parameters:
learning_rate: rate at which algorithm updates the parameter. 
               Tensor or float type of value.Default value is 0.001
beta_1: Exponential decay rate for 1st moment. Constant Float 
        tensor or float type of value. Default value is 0.9
beta_2: Exponential decay rate for weighted infinity norm. 
        Constant Float tensor or float type of value. 
        Default value is 0.999
epsilon: Small value used to sustain numerical stability. 
         Floating point type of value. Default value is 1e-07
name: Optional name for the operation
**kwargs: Keyworded variable length argument length

好处：

为具有高曲率或噪声梯度的梯度提供更好的结果。
学得更快

缺点：有时可能无法收敛到最优解

FTRL优化器

Follow The Regularized Leader (FTRL) 是一种优化算法，最适合具有稀疏和大特征空间的浅层模型。该版本同时支持收缩型 L2 正则化（L2 惩罚和损失函数的总和）和在线 L2 正则化。

Syntax: tf.keras.optimizers.Ftrl(learning_rate=0.001, 
                                 learning_rate_power=-0.5, 
                                 initial_accumulator_value=0.1,
                                 l1_regularization_strength=0.0, 
                                 l2_regularization_strength=0.0,
                                 name='Ftrl', 
                          l2_shrinkage_regularization_strength=0.0, 
                                 beta=0.0,
                                 **kwargs)
Parameters:
learning_rate: rate at which algorithm updates the parameter. 
               Tensor or float type of value.Default value is 0.001
learning_rate_power: Controls the drop in learning rate during 
                     training. Float type of value. Should be less
                     than or equal to 0. Default value is -0.5.
initial_accumulator_value: Initial value for accumulator. Value
                           should be greater than or equal to zero.
                           Default value is 0.1.
l1_regularization_strength:Stabilization penalty.
                           Only positive values or 0 is allowed.
                           Float type of value.Default value is 0.0 
l2_regularization_strength: Stabiliztion Penalty.
                            Only positive values or 0 is allowed.
                               Float type of value.Default value is 0.0
name: Optional name for the operation
l2_shrinkage_regularization_strength: Magnitude Penalty.
                           Only positive values or 0 is allowed.
                           Float type of value.Default value is 0.0 
beta: Default float value is 0.0
**kwargs: Keyworded variable length argument length

优点：可以更好地最小化损失函数。

缺点：

如果正则化器的范围不足，则无法获得足够的稳定性。
如果正则化器的范围很大，那么它离最优决策还很远。