Pandas DataFrame.sample()(1)

📌 相关文章

📜 Pandas DataFrame.sample()(1)

📅 最后修改于: 2023-12-03 14:45:02.459000 🧑 作者: Mango

Pandas中的DataFrame.sample()

在数据科学和机器学习中，对于给定的数据集，如果要从中获取一个随机的采样样本，则使用DataFrame.sample()方法非常方便。pandas.DataFrame.sample()方法从 DataFrame 中获取随机 n 行，或从 DataFrame 中GET随机样本，而不是所有行，而且DataFrame.sample()方法可以自定义根据哪种分布选择数据样本。

语法

DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None)

参数

n：返回随机样本的行数。
frac：返回随机样本的行数的比例，取值范围在0到1之间。
replace：布尔值，默认为false。如果替换为真，那么样本容易重复选取。
weights：指定每行选择的概率值，array-like。
random_state：整数或 NumPy RandomState。如果为None，则使用全局随机数生成器，否则使用给定的随机数生成器。
axis：用于选择行或列，0表示选择行，1表示选择列。

返回值

返回DataFrame类型的随机采样的n行或frac比例行。

示例

import pandas as pd

data = {
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Edward'],
    'score': [90, 80, 70, 60, 50]
}

df = pd.DataFrame(data)

现在，我们将使用DataFrame.sample()方法从DataFrame随机采样n行。

df_sample = df.sample(n=3)
print(df_sample)

输出：

       name  score
4    Edward     50
0     Alice     90
2  Charlie     70

我们还可以使用占总样本数的比例创建样本。在此示例中，我们将随机选择DataFrame中的50％行。

df_sample_frac = df.sample(frac=0.5)
print(df_sample_frac)

输出：

       name  score
4    Edward     50
1       Bob     80

如果我们希望从DataFrame中进行带有重复行的随机采样，则可以将replace参数设置为True。

df_sample_replace = df.sample(n=6, replace=True)
print(df_sample_replace)

输出：

       name  score
0     Alice     90
3     David     60
3     David     60
1       Bob     80
0     Alice     90
1       Bob     80

我们还可以使用weights参数来指定选中每一行的概率值。

weights = [0.1, 0.2, 0.3, 0.3, 0.1]
df_sample_weights = df.sample(n=3, weights=weights)
print(df_sample_weights)

输出：

       name  score
2  Charlie     70
2  Charlie     70
3     David     60

这意味着选中行的概率为第二行(0.2)+第三行(0.3)+第四行(0.3)=0.8，所以最好从数据中抽取 3 行，以获得较好的近似概率值。