如何从 PySpark DataFrame 中获取随机行？

在本文中，我们将学习如何使用Python编程语言从 PySpark DataFrame 中获取随机行。

方法 1：PySpark sample() 方法

PySpark 提供了各种采样方法，用于从给定的 PySpark DataFrame 返回样本。

以下是 sample() 方法的详细信息：

Syntax : DataFrame.sample(withReplacement,fractionfloat,seed)

It returns a subset of the DataFrame.

Parameters :

withReplacement : bool, optional

Sample with replacement or not (default False).

fractionfloat : optional

Fraction of rows to generate

seed : int, optional

Used to reproduce the same random sampling.

编程需要懂一点英语

示例：

在此示例中，我们需要在 [0.0,1.0] 范围内添加一小部分浮点数据类型。使用公式：

Number of rows needed = Fraction * Total Number of rows

We can say that the fraction needed for us is 1/total number of rows.

编程需要懂一点英语

Python

# importing the library and
# its SparkSession functionality
import pyspark
from pyspark.sql import SparkSession
  
# creating a session to make DataFrames
random_row_session = SparkSession.builder.appName(
    'Random_Row_Session'
).getOrCreate()
  
# Pre-set data for our DataFrame
data = [['a', 1], ['b', 2], ['c', 3], ['d', 4]]
columns = ['Letters', 'Position']
  
# Creating a DataFrame
df = random_row_session.createDataFrame(data,
                                        columns)
  
# Printing the DataFrame
df.show()
  
# Taking a sample of df and storing it in #df2
# please not that the second argument here is a fraction
# of the data set we need(fraction is in float)
# number of rows = fraction * total number of rows
df2 = df.sample(False, 1.0/len(df.collect()))
  
# printing the sample row which is a DataFrame
df2.show()

Python

# importing the library and
# its SparkSession functionality
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import Row
  
# creating a session to make DataFrames
random_row_session = SparkSession.builder.appName(
    'Random_Row_Session'
).getOrCreate()
  
# Pre-set data for our DataFrame
data = [['a', 1], ['b', 2], ['c', 3], ['d', 4]]
columns = ['Letters', 'Position']
  
# Creating a DataFrame
df = random_row_session.createDataFrame(data,
                                        columns)
  
# Printing the DataFrame
df.show()
  
# Getting RDD object from the DataFrame
rdd = df.rdd
  
# Taking a single sample of from the RDD
# Putting num = 1 in the takeSample() function
rdd_sample = rdd.takeSample(withReplacement=False,
                            num=1)
print(rdd_sample)

Python

# importing the library and
# its SparkSession functionality
import pyspark
from pyspark.sql import SparkSession
  
# creating a session to make DataFrames
random_row_session = SparkSession.builder.appName(
    'Random_Row_Session'
).getOrCreate()
  
# Pre-set data for our DataFrame
data = [['a', 1], ['b', 2], ['c', 3], ['d', 4]]
columns = ['Letters', 'Position']
  
# Creating a DataFrame
df = random_row_session.createDataFrame(data,
                                        columns)
  
# Printing the DataFrame
df.show()
  
# Converting the DataFrame to
# a Pandas DataFrame and taking a sample row
pandas_random = df.toPandas().sample()
  
# Converting the sample into
# a PySpark DataFrame
df_random = random_row_session.createDataFrame(pandas_random)
  
# Showing our randomly selected row
df_random.show()

输出：

+-------+--------+
|Letters|Position|
+-------+--------+
|      a|       1|
|      b|       2|
|      c|       3|
|      d|       4|
+-------+--------+

+-------+--------+
|Letters|Position|
+-------+--------+
|      b|       2|
+-------+--------+

方法二：使用 takeSample() 方法

我们首先将 PySpark DataFrame 转换为 RDD。弹性分布式数据集 (RDD) 是 PySpark 中最简单、最基础的数据结构。它们是任何数据类型的数据的不可变集合。

我们可以使用DataFrame.rdd获取 Data Frame 的 RDD，然后使用takeSample()方法。

Syntax of takeSample() :

takeSample(withReplacement, num, seed=None)

Parameters :

withReplacement : bool, optional

Sample with replacement or not (default False).

num : int

the number of sample values

seed : int, optional

Used to reproduce the same random sampling.

Returns : It returns num number of rows from the DataFrame.

编程需要懂一点英语

示例：在此示例中，我们在 RDD 上使用参数 num = 1 的 takeSample() 方法来获取 Row 对象。 num 是样本数。

Python

# importing the library and
# its SparkSession functionality
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import Row
  
# creating a session to make DataFrames
random_row_session = SparkSession.builder.appName(
    'Random_Row_Session'
).getOrCreate()
  
# Pre-set data for our DataFrame
data = [['a', 1], ['b', 2], ['c', 3], ['d', 4]]
columns = ['Letters', 'Position']
  
# Creating a DataFrame
df = random_row_session.createDataFrame(data,
                                        columns)
  
# Printing the DataFrame
df.show()
  
# Getting RDD object from the DataFrame
rdd = df.rdd
  
# Taking a single sample of from the RDD
# Putting num = 1 in the takeSample() function
rdd_sample = rdd.takeSample(withReplacement=False,
                            num=1)
print(rdd_sample)

输出：

+-------+--------+
|Letters|Position|
+-------+--------+
|      a|       1|
|      b|       2|
|      c|       3|
|      d|       4|
+-------+--------+

[Row(Letters='c', Position=3)]

方法 3：将 PySpark DataFrame 转换为 Pandas DataFrame 并使用 sample() 方法

我们可以使用 toPandas()函数将 PySpark DataFrame 转换为 Pandas DataFrame。仅当预期生成的 Pandas 的 DataFrame 很小时才应使用此方法，因为所有数据都加载到驱动程序的内存中。这是一种实验方法。

然后我们将使用 Pandas 库的sample()方法。它从 Pandas DataFrame 的轴返回一个随机样本。

Syntax : PandasDataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None, ignore_index=False)

编程需要懂一点英语

示例：

在本例中，我们将把 PySpark DataFrame 转换为 Pandas DataFrame 并在其上使用 Pandas sample()函数。

Python

# importing the library and
# its SparkSession functionality
import pyspark
from pyspark.sql import SparkSession
  
# creating a session to make DataFrames
random_row_session = SparkSession.builder.appName(
    'Random_Row_Session'
).getOrCreate()
  
# Pre-set data for our DataFrame
data = [['a', 1], ['b', 2], ['c', 3], ['d', 4]]
columns = ['Letters', 'Position']
  
# Creating a DataFrame
df = random_row_session.createDataFrame(data,
                                        columns)
  
# Printing the DataFrame
df.show()
  
# Converting the DataFrame to
# a Pandas DataFrame and taking a sample row
pandas_random = df.toPandas().sample()
  
# Converting the sample into
# a PySpark DataFrame
df_random = random_row_session.createDataFrame(pandas_random)
  
# Showing our randomly selected row
df_random.show()

输出：

+-------+--------+
|Letters|Position|
+-------+--------+
|      a|       1|
|      b|       2|
|      c|       3|
|      d|       4|
+-------+--------+

+-------+--------+
|Letters|Position|
+-------+--------+
|      b|       2|
+-------+--------+