Python中的 Sklearn.StratifiedShuffleSplit()函数

在本文中，我们将了解sklearn库中的StratifiedShuffleSplit交叉验证器，它提供训练测试索引以将数据拆分为训练测试集。

什么是 StratifiedShuffleSplit？

StratifiedShuffleSplit是ShuffleSplit和StratifiedKFold的组合。使用StratifiedShuffleSplit ，类标签的分布比例在训练和测试数据集之间几乎是均匀的。 StratifiedShuffleSplit和StratifiedKFold (shuffle=True) 之间的主要区别在于，在StratifiedKFold中，数据集在开始时只打乱一次，然后拆分为指定数量的折叠。这丢弃了训练测试集重叠的任何机会。
但是，在StratifiedShuffleSplit中，每次在拆分完成之前都会对数据进行混洗，这就是为什么在训练测试集之间可能存在重叠的可能性更大的原因。

Syntax: sklearn.model_selection.StratifiedShuffleSplit(n_splits=10, *, test_size=None, train_size=None, random_state=None)

Parameters:

n_splits: int, default=10

Number of re-shuffling & splitting iterations.

test_size: float or int, default=None

If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split.

train_size: float or int, default=None

If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split.

random_state: int

Controls the randomness of the training and testing indices produced.

编程需要懂一点英语

下面是实现。

步骤 1)导入所需的模块。

Python3

# import the libraries
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedShuffleSplit

Python3

# convert data set into dataframe
churn_df = pd.read_csv(r"ChurnData.csv")
  
# assign dependent and indepenedent variables
X = churn_df[['tenure', 'age', 'address', 'income',
              'ed', 'employ', 'equip',   'callcard', 'wireless']]
  
y = churn_df['churn'].astype('int')

Python3

# data pre-processing
X = preprocessing.StandardScaler().fit(X).transform(X)

Python3

# use StratifiedShuffleSplit()
sss = StratifiedShuffleSplit(n_splits=4, test_size=0.5,
                             random_state=0)
sss.get_n_splits(X, y)

Python3

scores = []
  
# using regression to get predicted data
rf = RandomForestClassifier(n_estimators=40, max_depth=7)
for train_index, test_index in sss.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    rf.fit(X_train, y_train)
    pred = rf.predict(X_test)
    scores.append(accuracy_score(y_test, pred))
  
# get accurracy of each prediction
print(scores)

步骤 2)加载数据集并识别因变量和自变量。

数据集可以从这里下载。

蟒蛇3

# convert data set into dataframe
churn_df = pd.read_csv(r"ChurnData.csv")
  
# assign dependent and indepenedent variables
X = churn_df[['tenure', 'age', 'address', 'income',
              'ed', 'employ', 'equip',   'callcard', 'wireless']]
  
y = churn_df['churn'].astype('int')

步骤3）预处理数据。

蟒蛇3

# data pre-processing
X = preprocessing.StandardScaler().fit(X).transform(X)

步骤 4)创建StratifiedShuffleSplit类的对象。

蟒蛇3

# use StratifiedShuffleSplit()
sss = StratifiedShuffleSplit(n_splits=4, test_size=0.5,
                             random_state=0)
sss.get_n_splits(X, y)

输出：

步骤 5)调用实例并将数据帧拆分为训练样本和测试样本。 split()函数返回训练测试样本的索引。使用回归算法并比较每个预测值的准确性。

蟒蛇3

scores = []
  
# using regression to get predicted data
rf = RandomForestClassifier(n_estimators=40, max_depth=7)
for train_index, test_index in sss.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    rf.fit(X_train, y_train)
    pred = rf.predict(X_test)
    scores.append(accuracy_score(y_test, pred))
  
# get accurracy of each prediction
print(scores)

输出：