DaskGridSearchCV – GridSearchCV 的竞争对手

机器学习、人工智能和深度学习等数据科学领域的流行语最近在互联网上出现的次数最多。每个人都想尝试不同的机器学习和深度学习模型，并尽可能取得最好的结果。某些模型存在一些计算限制。为了获得机器学习中的最佳模型，有一种称为超参数调优的方法。
超参数调优基本上是获得为模型选择的最佳参数集。有两种常见的方法： GridSearchCV和RandomizedSearchCV 。
GridSearchCV 在寻找最佳参数时基本上考虑了候选的所有组合。当有大量参数及其值需要调整时，这又会花费很长时间。有一种方法可以加快这个过程。这是机器学习中占据大部分时间的主要内容。在深入研究方法部分之前，让我们先浏览一下 GridSearchCV 和并行计算概念的基础知识。

什么是网格搜索？

GridSearchCV 是一种从给定的参数网格集中搜索最佳参数值的技术。它基本上是一种交叉验证方法。需要输入模型和参数。提取最佳参数值，然后进行预测。

代码：解释 GridSearchCV 工作的Python代码：

python3

# Importing the libraries needed
pip install pandas
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
 
# Loading the Dataset
# A standard dataset here is taken for better understanding.
iris = pd.read_csv('https://raw.githubusercontent.com/pranavkotak8/Datasets/master/Iris.csv')
target=iris['Species']
iris.drop(columns={'Id','Species'},inplace=True)
 
# Assigning the parameters and its values which need to be tuned.
parameters = {'kernel': ['linear', 'rbf'], 'C':[1,2,3,6]}
 
# Fitting the SVM model
modelsvc = SVC()
 
# Performing the GridSearchCV
clf = GridSearchCV(modelsvc, parameters)
clf.fit(iris, target)

python3

# Importing the libraries which are required:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest
from sklearn import svm
from sklearn.model_selection import RandomizedSearchCV,GridSearchCV
import warnings
warnings.filterwarnings('ignore')
import seaborn as sns
import matplotlib.pyplot as plt
 
# Reading the train data
train = pd.read_csv('C:\\Users\\prana\\Downloads\\smartphone_activity_dataset.csv')
 
# Dropping the target column
train.drop(columns={'activity'},inplace=True)
 
# Scaling the data
from sklearn.preprocessing import MinMaxScaler
t = MinMaxScaler()
train_f = t.fit_transform(train)
train_f = pd.DataFrame(train_f)
 
# Splitting into train and test set
X_train,X_test,y_train,y_test=train_test_split(train_f,
            target, test_size = 0.8, random_state = 100)
 
# Importing the DaskGridSearchCV, importing time
# and also running the gridsearchcv
# So here we are using DaskGridSearchCV.
from dask_ml.model_selection import GridSearchCV as DaskGridSearchCV
start=time.time()
 
parameters={
              'C': [0.1, 1,5, 10,15,20,100,500], 
              'gamma': [0.5,0.80,1, 0.1],
              'kernel': ['rbf','linear','sigmoid']}
     
modelsvc=SVC()
 
gscv = DaskGridSearchCV(modelsvc, param_grid = parameters, cv = 5, n_jobs = -1)
 
grid_results = gscv.fit(X_train, y_train)
end = time.time()
print("Time Taken with Dask GridSearchCV:", end-start)
 
# Importing the GridSearchCV, importing time and
# also running the gridsearchcv
# So here we are using the normal GridSearchCV method to implement
# the same algorithm and same parameters with the same set of values.
# This is merely done to compare and measure the computational time for both the methods.
start = time.time()
gscv = GridSearchCV(svm.SVC(),  {
              'C': [0.1, 1,5, 10,15,20,100,500], 
              'gamma': [0.5,0.80,1, 0.1],
              'kernel': ['rbf','linear','sigmoid']
},cv = 5,return_train_score = False,n_jobs = -1)
 
grid_results = gscv.fit(X_train, y_train)
end = time.time()
print("Time Taken without Dask GridSearchCV:", end-start)

输出：

所以，在上面的代码中，我们看到了 GridSearchCV 是如何实现的。上面代码中，是SVM模型，同理，也可以使用其他模型。不同之处在于参数，它们的值会改变。在这里，我采用了 2 个参数，因此速度更快，但是如果我们有更多参数或复杂模型要拟合怎么办？让我们直接回答这个问题。

我们可以通过任何方式加快GridSearchCV的过程吗？

所以，答案是肯定的，我们可以提高 GridSearchCV 的速度。好吧，你一定想知道如何。因此，为此让我们深入了解 GridSearchCV 的实际工作原理。

GridSearchCV 的工作：

GridSearchCV 是一个用于Python的机器学习库。我们对估计器的指定参数值进行了详尽的搜索。估计器对象基本上需要提供评分函数，否则必须通过任何类型的评分。有两种主要方法可以在 GridSearchcv 上实现，它们是适合和预测的。还有其他的还有predict_proba,decision_function 等。但是提到的这两个是经常使用的。根据用于手头数据集进行分析的算法类型，它有自己不同的参数。用户需要为重要参数提供一组不同的值。通过交叉验证的 Gridsearchcv 将找出提到的参数的最佳值。也可以考虑为参数设置的默认值。

GridSearchCV 背后的直觉：

每个研究模型的数据科学家都需要最好的模型来进行最终的结论性分析。为此 GridSearchCV 可以帮助构建它。这里的程序被告知运行带有交叉验证的网格搜索。 GridSearchCV 中遵循的交叉验证是 k 折交叉验证方法。所以基本上在 k 折交叉验证中，给定的数据被分成 k 折，具体取决于分析师的需要，其中在其他时间点的每个折都用于测试。例如，如果 K=3，则在第一次迭代中，第一次折叠用于测试模型，其余折叠用于训练模型。在第二次迭代中，第二次折叠用于测试模型，第一次和第三次折叠用于训练模型。除非每个折叠都用于测试，否则将重复此操作。像这样评估网格搜索会考虑所有参数组合，并为特定问题中使用的算法找到最佳模型。
以下是列出的不同方法及其用途：

方法：

一些主要方法包括：

fit() – 此方法获取输入数据并拟合所有超参数值。
predict(X) – 考虑到通过拟合方法找到的最佳参数，对给定数据 X 进行预测。
score() – 它在评估最佳参数的数据后为我们提供分数。
get_params() – 它为我们提供了最佳参数及其值的列表。

您可以从链接下载数据

代码：

蟒蛇3

# Importing the libraries which are required:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest
from sklearn import svm
from sklearn.model_selection import RandomizedSearchCV,GridSearchCV
import warnings
warnings.filterwarnings('ignore')
import seaborn as sns
import matplotlib.pyplot as plt
 
# Reading the train data
train = pd.read_csv('C:\\Users\\prana\\Downloads\\smartphone_activity_dataset.csv')
 
# Dropping the target column
train.drop(columns={'activity'},inplace=True)
 
# Scaling the data
from sklearn.preprocessing import MinMaxScaler
t = MinMaxScaler()
train_f = t.fit_transform(train)
train_f = pd.DataFrame(train_f)
 
# Splitting into train and test set
X_train,X_test,y_train,y_test=train_test_split(train_f,
            target, test_size = 0.8, random_state = 100)
 
# Importing the DaskGridSearchCV, importing time
# and also running the gridsearchcv
# So here we are using DaskGridSearchCV.
from dask_ml.model_selection import GridSearchCV as DaskGridSearchCV
start=time.time()
 
parameters={
              'C': [0.1, 1,5, 10,15,20,100,500], 
              'gamma': [0.5,0.80,1, 0.1],
              'kernel': ['rbf','linear','sigmoid']}
     
modelsvc=SVC()
 
gscv = DaskGridSearchCV(modelsvc, param_grid = parameters, cv = 5, n_jobs = -1)
 
grid_results = gscv.fit(X_train, y_train)
end = time.time()
print("Time Taken with Dask GridSearchCV:", end-start)
 
# Importing the GridSearchCV, importing time and
# also running the gridsearchcv
# So here we are using the normal GridSearchCV method to implement
# the same algorithm and same parameters with the same set of values.
# This is merely done to compare and measure the computational time for both the methods.
start = time.time()
gscv = GridSearchCV(svm.SVC(),  {
              'C': [0.1, 1,5, 10,15,20,100,500], 
              'gamma': [0.5,0.80,1, 0.1],
              'kernel': ['rbf','linear','sigmoid']
},cv = 5,return_train_score = False,n_jobs = -1)
 
grid_results = gscv.fit(X_train, y_train)
end = time.time()
print("Time Taken without Dask GridSearchCV:", end-start)

输出：

GridSearchCV 的 Scikit-learn 版本和 Dask-版本的比较：

Scikit-Learn Version Time Taken(seconds)	424.300
Dask Version Time Taken in (seconds)	388.103

结论：

从输出中可以明显看出，我们可以说 DaskGridSearchCV 比普通 GridSearchCV 快 1.09 倍。我们反过来减少了搜索最佳参数值的时间。这可以应用于其他算法以及更多的参数集。
以下是应用 Dask-SearchCV 时需要考虑的一些关键点：

如果模型有管道并且早期步骤成本高昂，那么您将继承性能优势。
您尝试拟合的数据已经在一个集群上，那么 Dask-SearchCV 会表现得更好，因为它可以很好地处理远程数据。
如果您的数据非常大，那么这将无济于事。它用于调度 Scikit-Learn 估计器适合中小型数据。

参考资料： sklearn.model_selection.GridSearchCV.html