特征缩放是机器学习中最重要的数据预处理步骤之一。如果数据未按比例缩放,则计算特征之间距离的算法会偏向于数值较大的值。
基于树的算法对特征的规模相当不敏感。此外,特征缩放有助于机器学习,深度学习算法训练和收敛速度更快。
有一些特征缩放技术,例如标准化和标准化,它们是最受欢迎的,同时也是最令人困惑的。
让我们解决这个困惑。
归一化或 Min-Max Scaling用于将特征转换为类似的尺度。新点计算如下:
X_new = (X - X_min)/(X_max - X_min)
这将范围缩放到 [0, 1] 或有时 [-1, 1]。从几何上讲,变换将 n 维数据压缩成 n 维单位超立方体。当没有异常值时标准化很有用,因为它无法处理它们。通常,我们会按年龄而不是收入来衡量,因为只有少数人收入高,但年龄接近统一。
标准化或 Z-Score 归一化是通过从均值中减去并除以标准差来转换特征。这通常称为 Z 分数。
X_new = (X - mean)/Std
在数据遵循高斯分布的情况下,标准化可能会有所帮助。然而,这不一定是真的。从几何上讲,它将数据转换为原始数据的均值向量到原点,如果 std 分别为 1,则压缩或扩展点。我们可以看到,我们只是将均值和标准差更改为仍然正常的标准正态分布,因此分布的形状不受影响。
标准化不会受到异常值的影响,因为没有预定义的转换特征范围。
归一化和标准化的区别
S.NO. | Normalisation | Standardisation |
---|---|---|
1. | Minimum and maximum value of features are used for scaling | Mean and standard deviation is used for scaling. |
2. | It is used when features are of different scales. | It is used when we want to ensure zero mean and unit standard deviation. |
3. | Scales values between [0, 1] or [-1, 1]. | It is not bounded to a certain range. |
4. | It is really affected by outliers. | It is much less affected by outliers. |
5. | Scikit-Learn provides a transformer called MinMaxScaler for Normalization. |
Scikit-Learn provides a transformer called StandardScaler for standardization. |
6. | This transformation squishes the n-dimensional data into an n-dimensional unit hypercube. | It translates the data to the mean vector of original data to the origin and squishes or expands. |
7. | It is useful when we don’t know about the distribution | It is useful when the feature distribution is Normal or Gaussian. |
8. | It is a often called as Scaling Normalization | It is a often called as Z-Score Normalization. |