📌 相关文章

📜 从 PySpark DataFrame 中提取第一行和最后 N 行

📅 最后修改于: 2022-05-13 01:55:00.211000 🧑 作者: Mango

从 PySpark DataFrame 中提取第一行和最后 N 行

在本文中，我们将使用Python的PySpark 从数据帧中提取前 N 行和后 N 行。为了首先完成我们的任务，我们将创建一个示例数据框。

我们必须在 spark 会话的帮助下创建一个 spark 对象，并使用 getorcreate() 方法给出应用程序名称。

spark = SparkSession.builder.appName('sparkdf').getOrCreate()

最后，用列表和列列表到方法创建数据后：

dataframe = spark.createDataFrame(data, columns)

Python3

# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of employee data with 5 row values
data = [["1", "sravan", "company 1"],
        ["2", "ojaswi", "company 2"],
        ["3", "bobby", "company 3"],
        ["4", "rohith", "company 2"],
        ["5", "gnanesh", "company 1"]]
  
# specify column names
columns = ['Employee ID', 'Employee NAME', 'Company Name']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
  
print('Actual data in dataframe')
dataframe.show()

Python3

print("Top 2 rows ")
  
# extract top 2 rows
a = dataframe.head(2)
print(a)
  
print("Top 1 row ")
  
# extract top 1 row
a = dataframe.head(1)
print(a)

Python3

print("Top row ")
  
# extract top  row
a = dataframe.first()
print(a)

Python3

# show() function to get 
# 2 rows
dataframe.show(2)

Python3

print("Last 2 rows ")
  
# extract last 2 rows
a = dataframe.tail(2)
print(a)
  
print("Last 1 row ")
  
# extract last 1 row
a = dataframe.tail(1)
print(a)

输出：

提取前 N 行

我们可以使用下面将在一些示例的帮助下讨论的几种方法来提取前 N 行：

方法一：使用 head()

此函数用于提取给定数据帧中的前 N 行

Syntax: dataframe.head(n)

where,

n specifies the number of rows to be extracted from first
dataframe is the dataframe name created from the nested lists using pyspark.

编程需要懂一点英语

蟒蛇3

print("Top 2 rows ")
  
# extract top 2 rows
a = dataframe.head(2)
print(a)
  
print("Top 1 row ")
  
# extract top 1 row
a = dataframe.head(1)
print(a)

输出：

Top 2 rows

[Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′),

Row(Employee ID=’2′, Employee NAME=’ojaswi’, Company Name=’company 2′)]

Top 1 row

[Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′)]

编程需要懂一点英语

方法 2：使用 first()

此函数用于仅提取数据框中的一行。

Syntax: dataframe.first()

It doesn’t take any parameter
dataframe is the dataframe name created from the nested lists using pyspark

编程需要懂一点英语

蟒蛇3

print("Top row ")
  
# extract top  row
a = dataframe.first()
print(a)

输出：

Top row

Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′)

编程需要懂一点英语

方法 3：使用 show()

用于默认从上到下显示数据框。

Syntax: dataframe.show(n)

where,

dataframe is the input dataframe
n is the number of rows to be displayed from the top ,if n is not specified it will print entire rows in the dataframe

编程需要懂一点英语

蟒蛇3

# show() function to get 
# 2 rows
dataframe.show(2)

输出：

提取最后 N 行

提取最后一行意味着从给定的数据框中获取最后 N 行。为此，我们使用 tail()函数并可以获取最后 N 行

Syntax: dataframe.tail(n)

where,

n is the number to get last n rows
data frame is the input dataframe

编程需要懂一点英语

例子：

蟒蛇3

print("Last 2 rows ")
  
# extract last 2 rows
a = dataframe.tail(2)
print(a)
  
print("Last 1 row ")
  
# extract last 1 row
a = dataframe.tail(1)
print(a)

输出：

Last 2 rows

[Row(Employee ID=’4′, Employee NAME=’rohith’, Company Name=’company 2′),

Row(Employee ID=’5′, Employee NAME=’gnanesh’, Company Name=’company 1′)]

Last 1 row

[Row(Employee ID=’5′, Employee NAME=’gnanesh’, Company Name=’company 1′)]

编程需要懂一点英语