Python中的 Sklearn.StratifiedShuffleSplit()函数
在本文中,我们将了解sklearn库中的StratifiedShuffleSplit交叉验证器,它提供训练测试索引以将数据拆分为训练测试集。
什么是 StratifiedShuffleSplit?
StratifiedShuffleSplit是ShuffleSplit和StratifiedKFold的组合。使用StratifiedShuffleSplit ,类标签的分布比例在训练和测试数据集之间几乎是均匀的。 StratifiedShuffleSplit和StratifiedKFold (shuffle=True) 之间的主要区别在于,在StratifiedKFold中,数据集在开始时只打乱一次,然后拆分为指定数量的折叠。这丢弃了训练测试集重叠的任何机会。
但是,在StratifiedShuffleSplit中,每次在拆分完成之前都会对数据进行混洗,这就是为什么在训练测试集之间可能存在重叠的可能性更大的原因。
Syntax: sklearn.model_selection.StratifiedShuffleSplit(n_splits=10, *, test_size=None, train_size=None, random_state=None)
Parameters:
n_splits: int, default=10
Number of re-shuffling & splitting iterations.
test_size: float or int, default=None
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split.
train_size: float or int, default=None
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split.
random_state: int
Controls the randomness of the training and testing indices produced.
下面是实现。
步骤 1)导入所需的模块。
Python3
# import the libraries
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedShuffleSplit
Python3
# convert data set into dataframe
churn_df = pd.read_csv(r"ChurnData.csv")
# assign dependent and indepenedent variables
X = churn_df[['tenure', 'age', 'address', 'income',
'ed', 'employ', 'equip', 'callcard', 'wireless']]
y = churn_df['churn'].astype('int')
Python3
# data pre-processing
X = preprocessing.StandardScaler().fit(X).transform(X)
Python3
# use StratifiedShuffleSplit()
sss = StratifiedShuffleSplit(n_splits=4, test_size=0.5,
random_state=0)
sss.get_n_splits(X, y)
Python3
scores = []
# using regression to get predicted data
rf = RandomForestClassifier(n_estimators=40, max_depth=7)
for train_index, test_index in sss.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
rf.fit(X_train, y_train)
pred = rf.predict(X_test)
scores.append(accuracy_score(y_test, pred))
# get accurracy of each prediction
print(scores)
步骤 2)加载数据集并识别因变量和自变量。
数据集可以从这里下载。
蟒蛇3
# convert data set into dataframe
churn_df = pd.read_csv(r"ChurnData.csv")
# assign dependent and indepenedent variables
X = churn_df[['tenure', 'age', 'address', 'income',
'ed', 'employ', 'equip', 'callcard', 'wireless']]
y = churn_df['churn'].astype('int')
步骤3)预处理数据。
蟒蛇3
# data pre-processing
X = preprocessing.StandardScaler().fit(X).transform(X)
步骤 4)创建StratifiedShuffleSplit类的对象。
蟒蛇3
# use StratifiedShuffleSplit()
sss = StratifiedShuffleSplit(n_splits=4, test_size=0.5,
random_state=0)
sss.get_n_splits(X, y)
输出:
步骤 5)调用实例并将数据帧拆分为训练样本和测试样本。 split()函数返回训练测试样本的索引。使用回归算法并比较每个预测值的准确性。
蟒蛇3
scores = []
# using regression to get predicted data
rf = RandomForestClassifier(n_estimators=40, max_depth=7)
for train_index, test_index in sss.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
rf.fit(X_train, y_train)
pred = rf.predict(X_test)
scores.append(accuracy_score(y_test, pred))
# get accurracy of each prediction
print(scores)
输出: