LightGBM(光梯度增强机)

📌 相关文章

📜 LightGBM(光梯度增强机)

📅 最后修改于: 2021-04-16 08:58:04 🧑 作者: Mango

LightGBM是基于决策树的梯度增强框架，可提高模型效率并减少内存使用。
它使用两种新颖的技术：基于梯度的单边采样和专有特征捆绑(EFB) ，它满足了主要在所有GBDT(梯度增强决策树)框架中使用的基于直方图的算法的局限性。下文所述的GOSS和EFB两种技术构成了LightGBM算法的特征。它们共同组成以使模型有效地工作，并使其比其他GBDT框架更具优势

LightGBM的基于梯度的单边采样技术：
不同的数据实例在信息增益的计算中具有不同的作用。具有较大梯度的实例(即，训练不足的实例)将为信息增益做出更多贡献。 GOSS使那些实例具有较大的梯度(例如，大于预定义的阈值或在最高百分位数之间)，并且仅随机丢弃那些具有小梯度的实例，以保持信息增益估计的准确性。与具有相同目标采样率的均匀随机采样相比，这种处理可以导致更准确的增益估计，尤其是当信息增益的值具有较大范围时。

GOSS算法：

Input: I: training data, d: iterations
Input: a: sampling ratio of large gradient data
Input: b: sampling ratio of small gradient data
Input: loss: loss function, L: weak learner
models ? {}, fact ? (1-a)/b
topN ? a × len(I), randN ? b × len(I)
for i = 1 to d do
    preds ? models.predict(I) g ? loss(I, preds), w ? {1, 1, ...}
    sorted ? GetSortedIndices(abs(g))
    topSet ? sorted[1:topN]
    randSet ? RandomPick(sorted[topN:len(I)],
    randN)
    usedSet ? topSet + randSet
    w[randSet] × = fact . Assign weight f act to the
    small gradient data.
    newModel ? L(I[usedSet], g[usedSet],
    w[usedSet])
    models.append(newModel)

GOSS技术的数学分析(分割特征j的方差增益的计算)

对于具有n个实例{x ₁ ，…，x _n }的训练集，其中每个x _i是在空间X ^s中维为s的向量。在梯度提升的每次迭代中，损失函数相对于模型输出的负梯度表示为{g ₁ ，…，g _n }。在这种GOSS方法中，训练实例根据其梯度的绝对值以降序排列。然后，保留具有较大梯度的top-a×100％实例，并获得实例子集A。然后，对于由(1- a)×100％实例组成的具有较小梯度的其余集合A ^c ，我们进一步随机进行对大小为b×| A ^c |的子集B进行采样。最后，我们根据子集A上向量V _j (d)的估计方差增益对实例进行分割。 B．

使用GOSS方法的方差增益

其中A _l = {x _i ? A： _xij ? d}，A _r = {x _i ? A：x _ij > d}，B _l = {x _i ? B： _xij ? d}，B _r = {x _i ? B：x _ij > d}，系数(1-a)/ b用于将B上的梯度之和归一化为A ^c的大小。

LightGBM的专有功能捆绑技术：
高维数据通常非常稀疏，这为我们提供了一种设计几乎无损方法以减少要素数量的可能性。具体来说，在稀疏的特征空间中，许多特征是互斥的，即它们永远不会同时取非零值。专有功能可以安全地捆绑到一个功能中(称为专有功能包)。因此，直方图构建的复杂度从O( #data × #feature )变为O( #data × # bundle) ，而#bundle <<＃feature 。因此，在不损害准确性的情况下提高了训练框架的速度。

专有特征捆绑技术的算法：

Input: numData: number of data
Input: F: One bundle of exclusive features
binRanges ? {0}, totalBin ? 0
for f in F do
    totalBin += f.numBin
    binRanges.append(totalBin)
newBin ? new Bin(numData)
for i = 1 to numData do
    newBin[i] ? 0
    for j = 1 to len(F) do

        if F[j].bin[i] != 0 then
            newBin[i] ? F[j].bin[i] + binRanges[j]
Output: newBin, binRanges

建筑学：
与其他在树上逐级增长的增强算法相比，LightGBM可以对叶进行逐级拆分。它选择损失最大的叶子来生长。由于叶子是固定的，因此与级别算法相比，叶子算法的损失较低。叶状树的生长可能会增加模型的复杂性，并可能导致小型数据集的过度拟合。
下图是智慧树生长的示意图：

代码：LightGBM模型的Python实现：
此示例使用的数据集是“乳腺癌预测”。单击此按钮以获取数据集：链接到数据集。

# installing LightGBM (Required in Jupyter Notebook and 
# few other compilers once)
pip install lightgbm
  
# Importing Required Library
import pandas as pd
import lightgbm as lgb
  
# Similarly LGBMRegressor can also be imported for a regression model.
from lightgbm import LGBMClassifier
  
# Reading the train and test dataset
data = pd.read_csv("cancer_prediction.csv)
  
# Removing Columns not Required
data = data.drop(columns = ['Unnamed: 32'], axis = 1)
data = data.drop(columns = ['id'], axis = 1)
  
# Skipping Data Exploration
# Dummification of Diagnosis Column (1-Benign, 0-Malignant Cancer)
data['diagnosis']= pd.get_dummies(data['diagnosis'])
  
# Spliiting Dataset in two parts
train = data[0:400]
test = data[400:568]
  
# Separating the independent and target variable on both data set
x_train = train.drop(columns =['diagnosis'], axis = 1)
y_train = train_data['diagnosis']
x_test = test_data.drop(columns =['diagnosis'], axis = 1)
y_test = test_data['diagnosis']
  
# Creating an object for model and fitting it on training data set 
model = LGBMClassifier(model = LGBMClassifier()
model.fit(x_train, y_train)
  
# Predicting the Target variable
pred = model.fit(x_test)
print(pred)
accuracy = model.score(x_test, y_test)
print(accuracy)

Output
Prediction array : 
[0 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 0 1
 1 1 1 1 0 1 1 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1
 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 0 1 0 0 1 1 1 1 1 0 0 1 0 1 0 1 1 1 1 1 0 1
 1 0 1 0 1 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 1 0 0 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0]
Accuracy Score : 
0.9702380952380952

参数调整
下面列出了一些重要的参数及其用法：

max_depth：设置树的深度限制。默认值为20。在控制拟合时有效。
categorical_feature：它指定用于训练模型的分类特征。
bagging_fraction：它指定每次迭代要考虑的数据比例。
num_iterations：它指定要执行的迭代次数。默认值为100。
num_leaves：它指定一棵树上的叶子数。它应该小于max_depth的平方。
max_bin：它指定存储特征值的最大仓数。
min_data_in_bin：它指定一个容器中的最小数据量。
task：它指定我们希望执行的任务，即训练还是预测。默认条目是train 。此参数的另一个可能值是预测。
feature_fraction ：它指定每次迭代中要考虑的特征比例。默认值为一。