从 PySpark DataFrame 中提取第一行和最后 N 行
在本文中,我们将使用Python的PySpark 从数据帧中提取前 N 行和后 N 行。为了首先完成我们的任务,我们将创建一个示例数据框。
我们必须在 spark 会话的帮助下创建一个 spark 对象,并使用 getorcreate() 方法给出应用程序名称。
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
最后,用列表和列列表到方法创建数据后:
dataframe = spark.createDataFrame(data, columns)
Python3
# importing module
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data with 5 row values
data = [["1", "sravan", "company 1"],
["2", "ojaswi", "company 2"],
["3", "bobby", "company 3"],
["4", "rohith", "company 2"],
["5", "gnanesh", "company 1"]]
# specify column names
columns = ['Employee ID', 'Employee NAME', 'Company Name']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
print('Actual data in dataframe')
dataframe.show()
Python3
print("Top 2 rows ")
# extract top 2 rows
a = dataframe.head(2)
print(a)
print("Top 1 row ")
# extract top 1 row
a = dataframe.head(1)
print(a)
Python3
print("Top row ")
# extract top row
a = dataframe.first()
print(a)
Python3
# show() function to get
# 2 rows
dataframe.show(2)
Python3
print("Last 2 rows ")
# extract last 2 rows
a = dataframe.tail(2)
print(a)
print("Last 1 row ")
# extract last 1 row
a = dataframe.tail(1)
print(a)
输出:
提取前 N 行
我们可以使用下面将在一些示例的帮助下讨论的几种方法来提取前 N 行:
方法一:使用 head()
此函数用于提取给定数据帧中的前 N 行
Syntax: dataframe.head(n)
where,
- n specifies the number of rows to be extracted from first
- dataframe is the dataframe name created from the nested lists using pyspark.
蟒蛇3
print("Top 2 rows ")
# extract top 2 rows
a = dataframe.head(2)
print(a)
print("Top 1 row ")
# extract top 1 row
a = dataframe.head(1)
print(a)
输出:
Top 2 rows
[Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′),
Row(Employee ID=’2′, Employee NAME=’ojaswi’, Company Name=’company 2′)]
Top 1 row
[Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′)]
方法 2:使用 first()
此函数用于仅提取数据框中的一行。
Syntax: dataframe.first()
- It doesn’t take any parameter
- dataframe is the dataframe name created from the nested lists using pyspark
蟒蛇3
print("Top row ")
# extract top row
a = dataframe.first()
print(a)
输出:
Top row
Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′)
方法 3:使用 show()
用于默认从上到下显示数据框。
Syntax: dataframe.show(n)
where,
- dataframe is the input dataframe
- n is the number of rows to be displayed from the top ,if n is not specified it will print entire rows in the dataframe
蟒蛇3
# show() function to get
# 2 rows
dataframe.show(2)
输出:
提取最后 N 行
提取最后一行意味着从给定的数据框中获取最后 N 行。为此,我们使用 tail()函数并可以获取最后 N 行
Syntax: dataframe.tail(n)
where,
- n is the number to get last n rows
- data frame is the input dataframe
例子:
蟒蛇3
print("Last 2 rows ")
# extract last 2 rows
a = dataframe.tail(2)
print(a)
print("Last 1 row ")
# extract last 1 row
a = dataframe.tail(1)
print(a)
输出:
Last 2 rows
[Row(Employee ID=’4′, Employee NAME=’rohith’, Company Name=’company 2′),
Row(Employee ID=’5′, Employee NAME=’gnanesh’, Company Name=’company 1′)]
Last 1 row
[Row(Employee ID=’5′, Employee NAME=’gnanesh’, Company Name=’company 1′)]