📜  Scikit学习-扩展线性建模

📅  最后修改于: 2020-12-10 05:49:09             🧑  作者: Mango


本章重点介绍Sklearn中的多项式特征和流水线工具。

多项式特征介绍

经过数据非线性函数训练的线性模型通常可以保持线性方法的快速性能。它还允许他们适应更大范围的数据。这就是在机器学习中使用此类经过非线性函数训练的线性模型的原因。

一个这样的例子是,可以通过从系数构造多项式特征来扩展简单的线性回归。

数学上,假设我们有标准的线性回归模型,那么对于二维数据,它看起来像这样-

$$ Y = W_ {0} + W_ {1} X_ {1} + W_ {2} X_ {2} $$

现在,我们可以将特征组合到二阶多项式中,我们的模型如下所示:

$$ Y = W_ {0} + W_ {1} X_ {1} + W_ {2} X_ {2} + W_ {3} X_ {1} X_ {2} + W_ {4} X_1 ^ 2 + W_ { 5} X_2 ^ 2 $$

以上仍然是线性模型。在这里,我们看到了所得的多项式回归属于同一类线性模型,并且可以类似地求解。

为此,scikit-learn提供了一个名为PolynomialFeatures的模块。该模块将输入数据矩阵转换为给定程度的新数据矩阵。

参量

下表包含PolynomialFeatures模块使用的参数

Sr.No Parameter & Description
1

degree − integer, default = 2

It represents the degree of the polynomial features.

2

interaction_only − Boolean, default = false

By default, it is false but if set as true, the features that are products of most degree distinct input features, are produced. Such features are called interaction features.

3

include_bias − Boolean, default = true

It includes a bias column i.e. the feature in which all polynomials powers are zero.

4

order − str in {‘C’, ‘F’}, default = ‘C’

This parameter represents the order of output array in the dense case. ‘F’ order means faster to compute but on the other hand, it may slow down subsequent estimators.

属性

跟随表包含PolynomialFeatures模块使用的属性

Sr.No Attributes & Description
1

powers_ − array, shape (n_output_features, n_input_features)

It shows powers_ [i,j] is the exponent of the jth input in the ith output.

2

n_input_features _ − int

As name suggests, it gives the total number of input features.

3

n_output_features _ − int

As name suggests, it gives the total number of polynomial output features.

实施实例

以下Python脚本使用PolynomialFeatures转换器将8的数组转换为形状(4,2)-

from sklearn.preprocessing import PolynomialFeatures
import numpy as np
Y = np.arange(8).reshape(4, 2)
poly = PolynomialFeatures(degree=2)
poly.fit_transform(Y)

输出

array(
   [
      [ 1., 0., 1., 0., 0., 1.],
      [ 1., 2., 3., 4., 6., 9.],
      [ 1., 4., 5., 16., 20., 25.],
      [ 1., 6., 7., 36., 42., 49.]
   ]
)

使用管道工具精简

可以使用流水线工具简化上述类型的预处理,即将输入数据矩阵转换为给定程度的新数据矩阵,该工具基本上用于将多个估计量链接为一个。

下面的Python脚本使用Scikit-learn的Pipeline工具简化了预处理(将适合3阶多项式数据)。

#First, import the necessary packages.
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
import numpy as np

#Next, create an object of Pipeline tool
Stream_model = Pipeline([('poly', PolynomialFeatures(degree=3)), ('linear', LinearRegression(fit_intercept=False))])

#Provide the size of array and order of polynomial data to fit the model.
x = np.arange(5)
y = 3 - 2 * x + x ** 2 - x ** 3
Stream_model = model.fit(x[:, np.newaxis], y)

#Calculate the input polynomial coefficients.
Stream_model.named_steps['linear'].coef_

输出

array([ 3., -2., 1., -1.])

上面的输出表明,在多项式特征上训练的线性模型能够恢复精确的输入多项式系数。