如何从 PySpark DataFrame 中获取随机行?
在本文中,我们将学习如何使用Python编程语言从 PySpark DataFrame 中获取随机行。
方法 1:PySpark sample() 方法
PySpark 提供了各种采样方法,用于从给定的 PySpark DataFrame 返回样本。
以下是 sample() 方法的详细信息:
Syntax : DataFrame.sample(withReplacement,fractionfloat,seed)
It returns a subset of the DataFrame.
Parameters :
withReplacement : bool, optional
Sample with replacement or not (default False).
fractionfloat : optional
Fraction of rows to generate
seed : int, optional
Used to reproduce the same random sampling.
示例:
在此示例中,我们需要在 [0.0,1.0] 范围内添加一小部分浮点数据类型。使用公式:
Number of rows needed = Fraction * Total Number of rows
We can say that the fraction needed for us is 1/total number of rows.
Python
# importing the library and
# its SparkSession functionality
import pyspark
from pyspark.sql import SparkSession
# creating a session to make DataFrames
random_row_session = SparkSession.builder.appName(
'Random_Row_Session'
).getOrCreate()
# Pre-set data for our DataFrame
data = [['a', 1], ['b', 2], ['c', 3], ['d', 4]]
columns = ['Letters', 'Position']
# Creating a DataFrame
df = random_row_session.createDataFrame(data,
columns)
# Printing the DataFrame
df.show()
# Taking a sample of df and storing it in #df2
# please not that the second argument here is a fraction
# of the data set we need(fraction is in float)
# number of rows = fraction * total number of rows
df2 = df.sample(False, 1.0/len(df.collect()))
# printing the sample row which is a DataFrame
df2.show()
Python
# importing the library and
# its SparkSession functionality
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import Row
# creating a session to make DataFrames
random_row_session = SparkSession.builder.appName(
'Random_Row_Session'
).getOrCreate()
# Pre-set data for our DataFrame
data = [['a', 1], ['b', 2], ['c', 3], ['d', 4]]
columns = ['Letters', 'Position']
# Creating a DataFrame
df = random_row_session.createDataFrame(data,
columns)
# Printing the DataFrame
df.show()
# Getting RDD object from the DataFrame
rdd = df.rdd
# Taking a single sample of from the RDD
# Putting num = 1 in the takeSample() function
rdd_sample = rdd.takeSample(withReplacement=False,
num=1)
print(rdd_sample)
Python
# importing the library and
# its SparkSession functionality
import pyspark
from pyspark.sql import SparkSession
# creating a session to make DataFrames
random_row_session = SparkSession.builder.appName(
'Random_Row_Session'
).getOrCreate()
# Pre-set data for our DataFrame
data = [['a', 1], ['b', 2], ['c', 3], ['d', 4]]
columns = ['Letters', 'Position']
# Creating a DataFrame
df = random_row_session.createDataFrame(data,
columns)
# Printing the DataFrame
df.show()
# Converting the DataFrame to
# a Pandas DataFrame and taking a sample row
pandas_random = df.toPandas().sample()
# Converting the sample into
# a PySpark DataFrame
df_random = random_row_session.createDataFrame(pandas_random)
# Showing our randomly selected row
df_random.show()
输出:
+-------+--------+
|Letters|Position|
+-------+--------+
| a| 1|
| b| 2|
| c| 3|
| d| 4|
+-------+--------+
+-------+--------+
|Letters|Position|
+-------+--------+
| b| 2|
+-------+--------+
方法二:使用 takeSample() 方法
我们首先将 PySpark DataFrame 转换为 RDD。弹性分布式数据集 (RDD) 是 PySpark 中最简单、最基础的数据结构。它们是任何数据类型的数据的不可变集合。
我们可以使用DataFrame.rdd获取 Data Frame 的 RDD,然后使用takeSample()方法。
Syntax of takeSample() :
takeSample(withReplacement, num, seed=None)
Parameters :
withReplacement : bool, optional
Sample with replacement or not (default False).
num : int
the number of sample values
seed : int, optional
Used to reproduce the same random sampling.
Returns : It returns num number of rows from the DataFrame.
示例:在此示例中,我们在 RDD 上使用参数 num = 1 的 takeSample() 方法来获取 Row 对象。 num 是样本数。
Python
# importing the library and
# its SparkSession functionality
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import Row
# creating a session to make DataFrames
random_row_session = SparkSession.builder.appName(
'Random_Row_Session'
).getOrCreate()
# Pre-set data for our DataFrame
data = [['a', 1], ['b', 2], ['c', 3], ['d', 4]]
columns = ['Letters', 'Position']
# Creating a DataFrame
df = random_row_session.createDataFrame(data,
columns)
# Printing the DataFrame
df.show()
# Getting RDD object from the DataFrame
rdd = df.rdd
# Taking a single sample of from the RDD
# Putting num = 1 in the takeSample() function
rdd_sample = rdd.takeSample(withReplacement=False,
num=1)
print(rdd_sample)
输出:
+-------+--------+
|Letters|Position|
+-------+--------+
| a| 1|
| b| 2|
| c| 3|
| d| 4|
+-------+--------+
[Row(Letters='c', Position=3)]
方法 3:将 PySpark DataFrame 转换为 Pandas DataFrame 并使用 sample() 方法
我们可以使用 toPandas()函数将 PySpark DataFrame 转换为 Pandas DataFrame。仅当预期生成的 Pandas 的 DataFrame 很小时才应使用此方法,因为所有数据都加载到驱动程序的内存中。这是一种实验方法。
然后我们将使用 Pandas 库的sample()方法。它从 Pandas DataFrame 的轴返回一个随机样本。
Syntax : PandasDataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None, ignore_index=False)
示例:
在本例中,我们将把 PySpark DataFrame 转换为 Pandas DataFrame 并在其上使用 Pandas sample()函数。
Python
# importing the library and
# its SparkSession functionality
import pyspark
from pyspark.sql import SparkSession
# creating a session to make DataFrames
random_row_session = SparkSession.builder.appName(
'Random_Row_Session'
).getOrCreate()
# Pre-set data for our DataFrame
data = [['a', 1], ['b', 2], ['c', 3], ['d', 4]]
columns = ['Letters', 'Position']
# Creating a DataFrame
df = random_row_session.createDataFrame(data,
columns)
# Printing the DataFrame
df.show()
# Converting the DataFrame to
# a Pandas DataFrame and taking a sample row
pandas_random = df.toPandas().sample()
# Converting the sample into
# a PySpark DataFrame
df_random = random_row_session.createDataFrame(pandas_random)
# Showing our randomly selected row
df_random.show()
输出:
+-------+--------+
|Letters|Position|
+-------+--------+
| a| 1|
| b| 2|
| c| 3|
| d| 4|
+-------+--------+
+-------+--------+
|Letters|Position|
+-------+--------+
| b| 2|
+-------+--------+