Python的集成方法
合奏是指将一组元素视为一个整体而不是单个元素。 Ensemble 方法创建多个模型并将它们组合起来进行求解。集成方法有助于提高模型的鲁棒性/泛化性。在本文中,我们将讨论一些方法及其在Python的实现。为此,我们从 UCI 存储库中选择一个数据集。
基本的集成方法
1.平均法:主要用于回归问题。该方法由独立构建多个模型组成,并返回所有模型预测的平均值。一般来说,组合输出比单个输出更好,因为方差减少了。
在下面的示例中,训练了三个回归模型(线性回归、xgboost 和随机森林)并对它们的预测进行了平均。最终的预测输出是 pred_final。
Python3
# importing utility modules
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# importing machine learning models for prediction
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from sklearn.linear_model import LinearRegression
# loading train data set in dataframe from train_data.csv file
df = pd.read_csv("train_data.csv")
# getting target data from the dataframe
target = df["target"]
# getting train data from the dataframe
train = df.drop("target")
# Splitting between train data into training and validation dataset
X_train, X_test, y_train, y_test = train_test_split(
train, target, test_size=0.20)
# initializing all the model objects with default parameters
model_1 = LinearRegression()
model_2 = xgb.XGBRegressor()
model_3 = RandomForestRegressor()
# training all the model on the training dataset
model_1.fit(X_train, y_target)
model_2.fit(X_train, y_target)
model_3.fit(X_train, y_target)
# predicting the output on the validation dataset
pred_1 = model_1.predict(X_test)
pred_2 = model_2.predict(X_test)
pred_3 = model_3.predict(X_test)
# final prediction after averaging on the prediction of all 3 models
pred_final = (pred_1+pred_2+pred_3)/3.0
# printing the root mean squared error between real value and predicted value
print(mean_squared_error(y_test, pred_final))
Python
# importing utility modules
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
# importing machine learning models for prediction
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
# importing voting classifier
from sklearn.ensemble import VotingClassifier
# loading train data set in dataframe from train_data.csv file
df = pd.read_csv("train_data.csv")
# getting target data from the dataframe
target = df["Weekday"]
# getting train data from the dataframe
train = df.drop("Weekday")
# Splitting between train data into training and validation dataset
X_train, X_test, y_train, y_test = train_test_split(
train, target, test_size=0.20)
# initializing all the model objects with default parameters
model_1 = LogisticRegression()
model_2 = XGBClassifier()
model_3 = RandomForestClassifier()
# Making the final model using voting classifier
final_model = VotingClassifier(
estimators=[('lr', model_1), ('xgb', model_2), ('rf', model_3)], voting='hard')
# training all the model on the train dataset
final_model.fit(X_train, y_train)
# predicting the output on the test dataset
pred_final = final_model.predict(X_test)
# printing log loss between actual and predicted value
print(log_loss(y_test, pred_final))
Python3
# importing utility modules
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# importing machine learning models for prediction
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from sklearn.linear_model import LinearRegression
# importing stacking lib
from vecstack import stacking
# loading train data set in dataframe from train_data.csv file
df = pd.read_csv("train_data.csv")
# getting target data from the dataframe
target = df["target"]
# getting train data from the dataframe
train = df.drop("target")
# Splitting between train data into training and validation dataset
X_train, X_test, y_train, y_test = train_test_split(
train, target, test_size=0.20)
# initializing all the base model objects with default parameters
model_1 = LinearRegression()
model_2 = xgb.XGBRegressor()
model_3 = RandomForestRegressor()
# putting all base model objects in one list
all_models = [model_1, model_2, model_3]
# computing the stack features
s_train, s_test = stacking(all_models, X_train, X_test,
y_train, regression=True, n_folds=4)
# initializing the second-level model
final_model = model_1
# fitting the second level model with stack features
final_model = final_model.fit(s_train, y_train)
# predicting the final output using stacking
pred_final = final_model.predict(X_test)
# printing the root mean squared error between real value and predicted value
print(mean_squared_error(y_test, pred_final))
Python3
# importing utility modules
import pandas as pd
from sklearn.metrics import mean_squared_error
# importing machine learning models for prediction
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from sklearn.linear_model import LinearRegression
# importing train test split
from sklearn.model_selection import train_test_split
# loading train data set in dataframe from train_data.csv file
df = pd.read_csv("train_data.csv")
# getting target data from the dataframe
target = df["target"]
# getting train data from the dataframe
train = df.drop("target")
#Splitting between train data into training and validation dataset
X_train, X_test, y_train, y_test = train_test_split(train, target, test_size=0.20)
# performing the train test and validation split
train_ratio = 0.70
validation_ratio = 0.20
test_ratio = 0.10
# performing train test split
x_train, x_test, y_train, y_test = train_test_split(
train, target, test_size=1 - train_ratio)
# performing test validation split
x_val, x_test, y_val, y_test = train_test_split(
x_test, y_test, test_size=test_ratio/(test_ratio + validation_ratio))
# initializing all the base model objects with default parameters
model_1 = LinearRegression()
model_2 = xgb.XGBRegressor()
model_3 = RandomForestRegressor()
# training all the model on the train dataset
# training first model
model_1.fit(x_train, y_train)
val_pred_1 = model_1.predict(x_val)
test_pred_1 = model_1.predict(x_test)
# converting to dataframe
val_pred_1 = pd.DataFrame(val_pred_1)
test_pred_1 = pd.DataFrame(test_pred_1)
# training second model
model_2.fit(x_train, y_train)
val_pred_2 = model_2.predict(x_val)
test_pred_2 = model_2.predict(x_test)
# converting to dataframe
val_pred_2 = pd.DataFrame(val_pred_2)
test_pred_2 = pd.DataFrame(test_pred_2)
# training third model
model_3.fit(x_train, y_train)
val_pred_3 = model_1.predict(x_val)
test_pred_3 = model_1.predict(x_test)
# converting to dataframe
val_pred_3 = pd.DataFrame(val_pred_3)
test_pred_3 = pd.DataFrame(test_pred_3)
# concatenating validation dataset along with all the predicted validation data (meta features)
df_val = pd.concat([x_val, val_pred_1, val_pred_2, val_pred_3], axis=1)
df_test = pd.concat([x_test, test_pred_1, test_pred_2, test_pred_3], axis=1)
# making the final model using the meta features
final_model = LinearRegression()
final_model.fit(df_val, y_val)
# getting the final output
final_pred = final_model.predict(df_test)
#printing the root mean squared error between real value and predicted value
print(mean_squared_error(y_test, pred_final))
Python
# importing utility modules
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# importing machine learning models for prediction
import xgboost as xgb
# importing bagging module
from sklearn.ensemble import BaggingRegressor
# loading train data set in dataframe from train_data.csv file
df = pd.read_csv("train_data.csv")
# getting target data from the dataframe
target = df["target"]
# getting train data from the dataframe
train = df.drop("target")
# Splitting between train data into training and validation dataset
X_train, X_test, y_train, y_test = train_test_split(
train, target, test_size=0.20)
# initializing the bagging model using XGboost as base model with default parameters
model = BaggingRegressor(base_estimator=xgb.XGBRegressor())
# training model
model.fit(X_train, y_train)
# predicting the output on the test dataset
pred = model.predict(X_test)
# printing the root mean squared error between real value and predicted value
print(mean_squared_error(y_test, pred_final))
Python3
# importing utility modules
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# importing machine learning models for prediction
from sklearn.ensemble import GradientBoostingRegressor
# loading train data set in dataframe from train_data.csv file
df = pd.read_csv("train_data.csv")
# getting target data from the dataframe
target = df["target"]
# getting train data from the dataframe
train = df.drop("target")
# Splitting between train data into training and validation dataset
X_train, X_test, y_train, y_test = train_test_split(
train, target, test_size=0.20)
# initializing the boosting module with default parameters
model = GradientBoostingRegressor()
# training the model on the train dataset
model.fit(X_train, y_train)
# predicting the output on the test dataset
pred_final = model.predict(X_test)
# printing the root mean squared error between real value and predicted value
print(mean_squared_error(y_test, pred_final))
输出:
4560
2.最大投票:主要用于分类问题。该方法包括独立构建多个模型并获得称为“投票”的单独输出。具有最大票数的类作为输出返回。
在下面的示例中,使用 sklearn VotingClassifier 组合了三个分类模型(逻辑回归、xgboost 和随机森林),训练该模型并返回具有最大投票数的类作为输出。最终的预测输出是 pred_final。请注意,这是一种分类,而不是回归,因此损失可能与其他类型的集成方法不同。
Python
# importing utility modules
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
# importing machine learning models for prediction
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
# importing voting classifier
from sklearn.ensemble import VotingClassifier
# loading train data set in dataframe from train_data.csv file
df = pd.read_csv("train_data.csv")
# getting target data from the dataframe
target = df["Weekday"]
# getting train data from the dataframe
train = df.drop("Weekday")
# Splitting between train data into training and validation dataset
X_train, X_test, y_train, y_test = train_test_split(
train, target, test_size=0.20)
# initializing all the model objects with default parameters
model_1 = LogisticRegression()
model_2 = XGBClassifier()
model_3 = RandomForestClassifier()
# Making the final model using voting classifier
final_model = VotingClassifier(
estimators=[('lr', model_1), ('xgb', model_2), ('rf', model_3)], voting='hard')
# training all the model on the train dataset
final_model.fit(X_train, y_train)
# predicting the output on the test dataset
pred_final = final_model.predict(X_test)
# printing log loss between actual and predicted value
print(log_loss(y_test, pred_final))
输出:
231
让我们来看看更高级的集成方法
高级集成方法
集成方法广泛用于经典机器学习。使用 bagging 的算法示例有随机森林和 bagging 元估计器,使用 boosting 的算法示例有 GBM、XGBM、Adaboost 等。
作为机器学习模型的开发者,强烈建议使用集成方法。集成方法在几乎所有的竞赛和研究论文中都被广泛使用。
1. Stacking:是一种通过元模型(meta-classifier or meta-regression)组合多个模型(分类或回归)的集成方法。基础模型在完整数据集上进行训练,然后元模型在基础模型返回(作为输出)的特征上进行训练。堆叠中的基本模型通常是不同的。元模型有助于从基础模型中找到特征以达到最佳精度。
算法:
- Split the train dataset into n parts
- A base model (say linear regression) is fitted on n-1 parts and predictions are made for the nth part. This is done for each one of the n part of the train set.
- The base model is then fitted on the whole train dataset.
- This model is used to predict the test dataset.
- The Steps 2 to 4 are repeated for another base model which results in another set of predictions for the train and test dataset.
- The predictions on train data set are used as a feature to build the new model.
- This final model is used to make the predictions on test dataset
Stacking 与基本的集成方法有点不同,因为它有第一级和第二级模型。首先通过使用所有第一级模型训练数据集来提取堆叠特征。然后,第一级模型使用训练堆叠特征来训练模型,然后该模型使用测试堆叠特征预测最终输出。
蟒蛇3
# importing utility modules
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# importing machine learning models for prediction
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from sklearn.linear_model import LinearRegression
# importing stacking lib
from vecstack import stacking
# loading train data set in dataframe from train_data.csv file
df = pd.read_csv("train_data.csv")
# getting target data from the dataframe
target = df["target"]
# getting train data from the dataframe
train = df.drop("target")
# Splitting between train data into training and validation dataset
X_train, X_test, y_train, y_test = train_test_split(
train, target, test_size=0.20)
# initializing all the base model objects with default parameters
model_1 = LinearRegression()
model_2 = xgb.XGBRegressor()
model_3 = RandomForestRegressor()
# putting all base model objects in one list
all_models = [model_1, model_2, model_3]
# computing the stack features
s_train, s_test = stacking(all_models, X_train, X_test,
y_train, regression=True, n_folds=4)
# initializing the second-level model
final_model = model_1
# fitting the second level model with stack features
final_model = final_model.fit(s_train, y_train)
# predicting the final output using stacking
pred_final = final_model.predict(X_test)
# printing the root mean squared error between real value and predicted value
print(mean_squared_error(y_test, pred_final))
输出:
4510
2.混合:它类似于上面解释的堆叠方法,但不是使用整个数据集来训练基础模型,而是单独保存验证数据集以进行预测。
算法:
- Split the training dataset into train, test and validation dataset.
- Fit all the base models using train dataset.
- Make predictions on validation and test dataset.
- These predictions are used as features to build a second level model
- This model is used to make predictions on test and meta-features
蟒蛇3
# importing utility modules
import pandas as pd
from sklearn.metrics import mean_squared_error
# importing machine learning models for prediction
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from sklearn.linear_model import LinearRegression
# importing train test split
from sklearn.model_selection import train_test_split
# loading train data set in dataframe from train_data.csv file
df = pd.read_csv("train_data.csv")
# getting target data from the dataframe
target = df["target"]
# getting train data from the dataframe
train = df.drop("target")
#Splitting between train data into training and validation dataset
X_train, X_test, y_train, y_test = train_test_split(train, target, test_size=0.20)
# performing the train test and validation split
train_ratio = 0.70
validation_ratio = 0.20
test_ratio = 0.10
# performing train test split
x_train, x_test, y_train, y_test = train_test_split(
train, target, test_size=1 - train_ratio)
# performing test validation split
x_val, x_test, y_val, y_test = train_test_split(
x_test, y_test, test_size=test_ratio/(test_ratio + validation_ratio))
# initializing all the base model objects with default parameters
model_1 = LinearRegression()
model_2 = xgb.XGBRegressor()
model_3 = RandomForestRegressor()
# training all the model on the train dataset
# training first model
model_1.fit(x_train, y_train)
val_pred_1 = model_1.predict(x_val)
test_pred_1 = model_1.predict(x_test)
# converting to dataframe
val_pred_1 = pd.DataFrame(val_pred_1)
test_pred_1 = pd.DataFrame(test_pred_1)
# training second model
model_2.fit(x_train, y_train)
val_pred_2 = model_2.predict(x_val)
test_pred_2 = model_2.predict(x_test)
# converting to dataframe
val_pred_2 = pd.DataFrame(val_pred_2)
test_pred_2 = pd.DataFrame(test_pred_2)
# training third model
model_3.fit(x_train, y_train)
val_pred_3 = model_1.predict(x_val)
test_pred_3 = model_1.predict(x_test)
# converting to dataframe
val_pred_3 = pd.DataFrame(val_pred_3)
test_pred_3 = pd.DataFrame(test_pred_3)
# concatenating validation dataset along with all the predicted validation data (meta features)
df_val = pd.concat([x_val, val_pred_1, val_pred_2, val_pred_3], axis=1)
df_test = pd.concat([x_test, test_pred_1, test_pred_2, test_pred_3], axis=1)
# making the final model using the meta features
final_model = LinearRegression()
final_model.fit(df_val, y_val)
# getting the final output
final_pred = final_model.predict(df_test)
#printing the root mean squared error between real value and predicted value
print(mean_squared_error(y_test, pred_final))
输出:
4790
3. Bagging:也称为bootstrapping方法。基础模型在包上运行以获得整个数据集的公平分布。袋子是数据集的一个子集以及一个替换物,使袋子的大小与整个数据集相同。最终输出是在组合所有基础模型的输出后形成的。
算法:
- Create multiple datasets from the train dataset by selecting observations with replacements
- Run a base model on each of the created datasets independently
- Combine the predictions of all the base models to each the final output
Bagging 通常只使用一种基本模型(以下代码中使用的 XGBoost 回归器)。
Python
# importing utility modules
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# importing machine learning models for prediction
import xgboost as xgb
# importing bagging module
from sklearn.ensemble import BaggingRegressor
# loading train data set in dataframe from train_data.csv file
df = pd.read_csv("train_data.csv")
# getting target data from the dataframe
target = df["target"]
# getting train data from the dataframe
train = df.drop("target")
# Splitting between train data into training and validation dataset
X_train, X_test, y_train, y_test = train_test_split(
train, target, test_size=0.20)
# initializing the bagging model using XGboost as base model with default parameters
model = BaggingRegressor(base_estimator=xgb.XGBRegressor())
# training model
model.fit(X_train, y_train)
# predicting the output on the test dataset
pred = model.predict(X_test)
# printing the root mean squared error between real value and predicted value
print(mean_squared_error(y_test, pred_final))
输出:
4666
4. Boosting: Boosting 是一种顺序方法——它旨在防止错误的基础模型影响最终输出。该方法不是组合基础模型,而是专注于构建依赖于前一个模型的新模型。一个新模型试图消除其前一个模型所犯的错误。这些模型中的每一个都称为弱学习器。最终模型(又名强学习器)是通过获取所有弱学习器的加权平均值形成的。
算法:
- Take a subset of the train dataset.
- Train a base model on that dataset.
- Use third model to make predictions on the whole dataset.
- Calculate errors using the predicted values and actual values.
- Initialize all data points with same weight.
- Assign higher weight to incorrectly predicted data points.
- Make another model, make predictions using the new model in such a way that errors made by the previous model are mitigated/corrected.
- Similarly, create multiple models–each successive model correcting the errors of the previous model.
- The final model (strong learner) is the weighted mean of all the previous models (weak learners).
蟒蛇3
# importing utility modules
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# importing machine learning models for prediction
from sklearn.ensemble import GradientBoostingRegressor
# loading train data set in dataframe from train_data.csv file
df = pd.read_csv("train_data.csv")
# getting target data from the dataframe
target = df["target"]
# getting train data from the dataframe
train = df.drop("target")
# Splitting between train data into training and validation dataset
X_train, X_test, y_train, y_test = train_test_split(
train, target, test_size=0.20)
# initializing the boosting module with default parameters
model = GradientBoostingRegressor()
# training the model on the train dataset
model.fit(X_train, y_train)
# predicting the output on the test dataset
pred_final = model.predict(X_test)
# printing the root mean squared error between real value and predicted value
print(mean_squared_error(y_test, pred_final))
输出:
4789
注意: scikit-learn 为集成方法提供了几个模块/方法。请注意,一种方法的准确性并不意味着一种方法优于另一种方法。本文旨在简要介绍集成方法,而不是在它们之间进行比较。程序员必须使用适合数据的方法。