PySpark 中的简单随机抽样和分层抽样

在本文中，我们将讨论 PySpark 中的简单随机抽样和分层抽样。

简单随机抽样：

在简单随机抽样中，每个元素都不是按特定顺序获得的。换句话说，它们是随机获得的。这就是为什么元素同样可能被选中的原因。简单来说，随机抽样被定义为从大型数据集中随机选择一个子集的过程。 PySpark 中的简单随机抽样可以通过 sample()函数获得。简单抽样有两种类型：替换和不替换。这些类型的随机抽样将在下面详细讨论，

方法一：放回随机抽样

带放回的随机抽样是一种随机抽样，其中先前随机选择的元素返回给总体，现在随机抽取一个随机元素。

Syntax:

sample(True, fraction, seed)

Here,

fraction: It represents the fraction of rows to be generated. It might range from 0.0 to 1.0 (inclusive)
seed: It represents the seed required sampling (By default it is a random seed). It is used to regenerate the same random sampling.

编程需要懂一点英语

例子：

Python3

# Python program to demonstrate random
# sampling in pyspark with replacement
 
# Import libraries
import pandas as pd
from pyspark.sql import Row
from pyspark.sql import SparkSession
 
 
# Create a session
spark = SparkSession.builder.getOrCreate()
 
# Create dataframe by passing passing list
df = spark.createDataFrame([
    Row(Brand="Redmi", Units=1000000, Performance="Outstanding", Ecofriendly="Yes"),
    Row(Brand="Samsung", Units=900000, Performance="Outstanding",  Ecofriendly="Yes"),
    Row(Brand="Nokia", Units=500000, Performance="Excellent",  Ecofriendly="Yes"),
    Row(Brand="Motorola",Units=400000, Performance="Average",  Ecofriendly="Yes"),
    Row(Brand="Apple", Units=2000000,Performance="Outstanding",  Ecofriendly="Yes")
])
 
# Apply sample() function with replacement
df_mobile_brands = df.sample(True, 0.5, 42)
 
# Print to the console
df_mobile_brands.show()

Python3

# Python program to demonstrate random
# sampling in pyspark without replacement
 
# Import libraries
import pandas as pd
from pyspark.sql import Row
from pyspark.sql import SparkSession
 
# create the session
spark = SparkSession.builder.getOrCreate()
 
# Create dataframe by passing passing list
df = spark.createDataFrame([
    Row(Brand="Redmi", Units=1000000, Performance="Outstanding", Ecofriendly="Yes"),
    Row(Brand="Samsung", Units=900000, Performance="Outstanding",  Ecofriendly="Yes"),
    Row(Brand="Nokia", Units=500000, Performance="Excellent",  Ecofriendly="Yes"),
    Row(Brand="Motorola",Units=400000, Performance="Average",  Ecofriendly="Yes"),
    Row(Brand="Apple", Units=2000000,Performance="Outstanding",  Ecofriendly="Yes")
])
 
# Apply sample() function without replacement
df_mobile_brands = df.sample(False, 0.5, 42)
 
# Print to the console
df_mobile_brands.show()

Python3

# Python program to demonstrate stratified sampling in pyspark
 
# Import libraries
import pandas as pd
from pyspark.sql import Row
from pyspark.sql import SparkSession
 
# Create the session
spark = SparkSession.builder.getOrCreate()
 
# Creating dataframe by passing list
df = spark.createDataFrame([
    Row(Brand="Redmi", Units=1000000, Performance="Outstanding", Ecofriendly="Yes"),
    Row(Brand="Samsung", Units=1000000, Performance="Outstanding",  Ecofriendly="Yes"),
    Row(Brand="Nokia", Units=400000, Performance="Excellent",  Ecofriendly="Yes"),
    Row(Brand="Motorola",Units=400000, Performance="Average",  Ecofriendly="Yes"),
    Row(Brand="OPPO",Units=400000, Performance="Average",  Ecofriendly="Yes"),
    Row(Brand="Apple", Units=2000000,Performance="Outstanding",  Ecofriendly="Yes")
])
 
# Applying sampleBy() function
mobile_brands = df.sampleBy("Units", fractions={
  1000000: 0.2, 2000000: 0.4, 400000: 0.2}, seed=0)
 
# Print to the console
mobile_brands.show()

输出：

方法二：无放回随机抽样

无放回随机抽样是一种随机抽样，其中每一组在样本中只有一次被抽取的机会。

句法：

sample(False, fraction, seed)

Here,

fraction: It represents the fraction of rows to be generated. It might range from 0.0 to 1.0 (inclusive)

seed: It represents the seed required sampling (By default it is a random seed). It is used to regenerate the same random sampling.

编程需要懂一点英语

例子：

Python3

# Python program to demonstrate random
# sampling in pyspark without replacement
 
# Import libraries
import pandas as pd
from pyspark.sql import Row
from pyspark.sql import SparkSession
 
# create the session
spark = SparkSession.builder.getOrCreate()
 
# Create dataframe by passing passing list
df = spark.createDataFrame([
    Row(Brand="Redmi", Units=1000000, Performance="Outstanding", Ecofriendly="Yes"),
    Row(Brand="Samsung", Units=900000, Performance="Outstanding",  Ecofriendly="Yes"),
    Row(Brand="Nokia", Units=500000, Performance="Excellent",  Ecofriendly="Yes"),
    Row(Brand="Motorola",Units=400000, Performance="Average",  Ecofriendly="Yes"),
    Row(Brand="Apple", Units=2000000,Performance="Outstanding",  Ecofriendly="Yes")
])
 
# Apply sample() function without replacement
df_mobile_brands = df.sample(False, 0.5, 42)
 
# Print to the console
df_mobile_brands.show()

输出：

方法3：pyspark中的分层抽样

在分层抽样的情况下，每个成员都被分为具有相同结构的组（同质组），称为分层，我们选择每个此类子组的代表（称为分层）。可以使用 sampleBy()函数计算 pyspark 中的分层采样。语法如下，

句法：

sampleBy(column, fractions, seed=None)

Here,

column: the column that defines strata
fractions: It represents the sampling fraction for every stratum. When the stratum is not given, we assume fraction as zero.
seed: It represents the seed required sampling (By default it is a random seed). It is used to regenerate the same random sampling.

编程需要懂一点英语

例子：

在此示例中，我们有 1000000、400000 和 2000000 三个层，它们分别根据分数 0.2、0.4 和 0.2 进行选择。

Python3

# Python program to demonstrate stratified sampling in pyspark
 
# Import libraries
import pandas as pd
from pyspark.sql import Row
from pyspark.sql import SparkSession
 
# Create the session
spark = SparkSession.builder.getOrCreate()
 
# Creating dataframe by passing list
df = spark.createDataFrame([
    Row(Brand="Redmi", Units=1000000, Performance="Outstanding", Ecofriendly="Yes"),
    Row(Brand="Samsung", Units=1000000, Performance="Outstanding",  Ecofriendly="Yes"),
    Row(Brand="Nokia", Units=400000, Performance="Excellent",  Ecofriendly="Yes"),
    Row(Brand="Motorola",Units=400000, Performance="Average",  Ecofriendly="Yes"),
    Row(Brand="OPPO",Units=400000, Performance="Average",  Ecofriendly="Yes"),
    Row(Brand="Apple", Units=2000000,Performance="Outstanding",  Ecofriendly="Yes")
])
 
# Applying sampleBy() function
mobile_brands = df.sampleBy("Units", fractions={
  1000000: 0.2, 2000000: 0.4, 400000: 0.2}, seed=0)
 
# Print to the console
mobile_brands.show()

输出：