PySpark 中的简单随机抽样和分层抽样
在本文中,我们将讨论 PySpark 中的简单随机抽样和分层抽样。
简单随机抽样:
在简单随机抽样中,每个元素都不是按特定顺序获得的。换句话说,它们是随机获得的。这就是为什么元素同样可能被选中的原因。简单来说,随机抽样被定义为从大型数据集中随机选择一个子集的过程。 PySpark 中的简单随机抽样可以通过 sample()函数获得。简单抽样有两种类型:替换和不替换。这些类型的随机抽样将在下面详细讨论,
方法一:放回随机抽样
带放回的随机抽样是一种随机抽样,其中先前随机选择的元素返回给总体,现在随机抽取一个随机元素。
Syntax:
sample(True, fraction, seed)
Here,
- fraction: It represents the fraction of rows to be generated. It might range from 0.0 to 1.0 (inclusive)
- seed: It represents the seed required sampling (By default it is a random seed). It is used to regenerate the same random sampling.
例子:
Python3
# Python program to demonstrate random
# sampling in pyspark with replacement
# Import libraries
import pandas as pd
from pyspark.sql import Row
from pyspark.sql import SparkSession
# Create a session
spark = SparkSession.builder.getOrCreate()
# Create dataframe by passing passing list
df = spark.createDataFrame([
Row(Brand="Redmi", Units=1000000, Performance="Outstanding", Ecofriendly="Yes"),
Row(Brand="Samsung", Units=900000, Performance="Outstanding", Ecofriendly="Yes"),
Row(Brand="Nokia", Units=500000, Performance="Excellent", Ecofriendly="Yes"),
Row(Brand="Motorola",Units=400000, Performance="Average", Ecofriendly="Yes"),
Row(Brand="Apple", Units=2000000,Performance="Outstanding", Ecofriendly="Yes")
])
# Apply sample() function with replacement
df_mobile_brands = df.sample(True, 0.5, 42)
# Print to the console
df_mobile_brands.show()
Python3
# Python program to demonstrate random
# sampling in pyspark without replacement
# Import libraries
import pandas as pd
from pyspark.sql import Row
from pyspark.sql import SparkSession
# create the session
spark = SparkSession.builder.getOrCreate()
# Create dataframe by passing passing list
df = spark.createDataFrame([
Row(Brand="Redmi", Units=1000000, Performance="Outstanding", Ecofriendly="Yes"),
Row(Brand="Samsung", Units=900000, Performance="Outstanding", Ecofriendly="Yes"),
Row(Brand="Nokia", Units=500000, Performance="Excellent", Ecofriendly="Yes"),
Row(Brand="Motorola",Units=400000, Performance="Average", Ecofriendly="Yes"),
Row(Brand="Apple", Units=2000000,Performance="Outstanding", Ecofriendly="Yes")
])
# Apply sample() function without replacement
df_mobile_brands = df.sample(False, 0.5, 42)
# Print to the console
df_mobile_brands.show()
Python3
# Python program to demonstrate stratified sampling in pyspark
# Import libraries
import pandas as pd
from pyspark.sql import Row
from pyspark.sql import SparkSession
# Create the session
spark = SparkSession.builder.getOrCreate()
# Creating dataframe by passing list
df = spark.createDataFrame([
Row(Brand="Redmi", Units=1000000, Performance="Outstanding", Ecofriendly="Yes"),
Row(Brand="Samsung", Units=1000000, Performance="Outstanding", Ecofriendly="Yes"),
Row(Brand="Nokia", Units=400000, Performance="Excellent", Ecofriendly="Yes"),
Row(Brand="Motorola",Units=400000, Performance="Average", Ecofriendly="Yes"),
Row(Brand="OPPO",Units=400000, Performance="Average", Ecofriendly="Yes"),
Row(Brand="Apple", Units=2000000,Performance="Outstanding", Ecofriendly="Yes")
])
# Applying sampleBy() function
mobile_brands = df.sampleBy("Units", fractions={
1000000: 0.2, 2000000: 0.4, 400000: 0.2}, seed=0)
# Print to the console
mobile_brands.show()
输出:
方法二:无放回随机抽样
无放回随机抽样是一种随机抽样,其中每一组在样本中只有一次被抽取的机会。
句法:
sample(False, fraction, seed)
Here,
fraction: It represents the fraction of rows to be generated. It might range from 0.0 to 1.0 (inclusive)
seed: It represents the seed required sampling (By default it is a random seed). It is used to regenerate the same random sampling.
例子:
Python3
# Python program to demonstrate random
# sampling in pyspark without replacement
# Import libraries
import pandas as pd
from pyspark.sql import Row
from pyspark.sql import SparkSession
# create the session
spark = SparkSession.builder.getOrCreate()
# Create dataframe by passing passing list
df = spark.createDataFrame([
Row(Brand="Redmi", Units=1000000, Performance="Outstanding", Ecofriendly="Yes"),
Row(Brand="Samsung", Units=900000, Performance="Outstanding", Ecofriendly="Yes"),
Row(Brand="Nokia", Units=500000, Performance="Excellent", Ecofriendly="Yes"),
Row(Brand="Motorola",Units=400000, Performance="Average", Ecofriendly="Yes"),
Row(Brand="Apple", Units=2000000,Performance="Outstanding", Ecofriendly="Yes")
])
# Apply sample() function without replacement
df_mobile_brands = df.sample(False, 0.5, 42)
# Print to the console
df_mobile_brands.show()
输出:
方法3:pyspark中的分层抽样
在分层抽样的情况下,每个成员都被分为具有相同结构的组(同质组),称为分层,我们选择每个此类子组的代表(称为分层)。可以使用 sampleBy()函数计算 pyspark 中的分层采样。语法如下,
句法:
sampleBy(column, fractions, seed=None)
Here,
- column: the column that defines strata
- fractions: It represents the sampling fraction for every stratum. When the stratum is not given, we assume fraction as zero.
- seed: It represents the seed required sampling (By default it is a random seed). It is used to regenerate the same random sampling.
例子:
在此示例中,我们有 1000000、400000 和 2000000 三个层,它们分别根据分数 0.2、0.4 和 0.2 进行选择。
Python3
# Python program to demonstrate stratified sampling in pyspark
# Import libraries
import pandas as pd
from pyspark.sql import Row
from pyspark.sql import SparkSession
# Create the session
spark = SparkSession.builder.getOrCreate()
# Creating dataframe by passing list
df = spark.createDataFrame([
Row(Brand="Redmi", Units=1000000, Performance="Outstanding", Ecofriendly="Yes"),
Row(Brand="Samsung", Units=1000000, Performance="Outstanding", Ecofriendly="Yes"),
Row(Brand="Nokia", Units=400000, Performance="Excellent", Ecofriendly="Yes"),
Row(Brand="Motorola",Units=400000, Performance="Average", Ecofriendly="Yes"),
Row(Brand="OPPO",Units=400000, Performance="Average", Ecofriendly="Yes"),
Row(Brand="Apple", Units=2000000,Performance="Outstanding", Ecofriendly="Yes")
])
# Applying sampleBy() function
mobile_brands = df.sampleBy("Units", fractions={
1000000: 0.2, 2000000: 0.4, 400000: 0.2}, seed=0)
# Print to the console
mobile_brands.show()
输出: