如何使用 PySpark 在数据框中获取不同的行?
在本文中,我们将在Python中从 pyspark 数据帧中获取不同的数据,因此我们将使用嵌套列表创建数据帧并获取不同的数据。
我们将从 pyspark 列表创建一个数据帧,绕过列表到 pyspark 的 createDataFrame() 方法,然后通过使用 distinct()函数,我们将从数据帧中获取不同的行。
Syntax: dataframe.distinct()
Where dataframe is the dataframe name created from the nested lists using pyspark
示例 1 :从列表列表创建的数据框中的大学数据中获取不同数据的Python代码。
Python3
# importing module
import pyspark
# importing sparksession from
# pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving
# an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of college data
data = [["1", "bobby", "vvit"],
["2", "sravan", "jntuk"],
["3", "rohith", "AU"],
["4", "sridevi", "GVRS"],
["1", "bobby", "vvit"]]
# specify column names
columns = ['ID', 'NAME', 'COLLEGE']
# creating a dataframe from the
# lists of data
dataframe = spark.createDataFrame(data, columns)
print('Actual data in dataframe')
dataframe.show()
Python3
print('distinct data')
# display distinct data
dataframe.distinct().show()
Python3
# importing module
import pyspark
# importing sparksession from
# pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving
# an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of college data
data = [["1", "bobby", "vvit"]]
# specify column names
columns = ['ID', 'NAME', 'COLLEGE']
# creating a dataframe from the
# list of data
dataframe = spark.createDataFrame(data, columns)
print('Actual data in dataframe')
dataframe.show()
Python3
print('distinct data')
# display distinct data from
# the dataframe
dataframe.distinct().show()
输出:
现在获取数据框中的不同行:
蟒蛇3
print('distinct data')
# display distinct data
dataframe.distinct().show()
输出:
示例 2:从 1 行中查找不同值的Python程序
蟒蛇3
# importing module
import pyspark
# importing sparksession from
# pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving
# an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of college data
data = [["1", "bobby", "vvit"]]
# specify column names
columns = ['ID', 'NAME', 'COLLEGE']
# creating a dataframe from the
# list of data
dataframe = spark.createDataFrame(data, columns)
print('Actual data in dataframe')
dataframe.show()
输出:
现在获取数据框中的不同行:
蟒蛇3
print('distinct data')
# display distinct data from
# the dataframe
dataframe.distinct().show()
输出: