在 PySpark 数据框中显示不同的列值

在本文中，我们将使用Python的pyspark 显示来自数据帧的不同列值。为此，我们使用了 distinct() 和 dropDuplicates() 函数以及 select()函数。

让我们创建一个示例数据框。

Python3

# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of employee data with 5 row values
data =[["1", "sravan", "company 1"],
       ["3", "bobby", "company 3"],
       ["2", "ojaswi", "company 2"],
       ["1", "sravan", "company 1"],
       ["3", "bobby", "company 3"],
       ["4", "rohith", "company 2"],
       ["5", "gnanesh", "company 1"]]
  
# specify column names
columns = ['Employee ID','Employee NAME','Company Name']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data,columns)
  
dataframe.show()

Python3

# select first column to get
# unique data using distinct function()
dataframe.select("Employee ID").distinct().show()

Python3

# select first and second column
# to get unique data using distinct function()
dataframe.select(["Employee ID",
                  "Employee NAME"]).distinct().show()

Python3

#select first  column to get
# unique data using dropDuplicates function()
dataframe.select("Employee ID").dropDuplicates().show()

Python3

#select first and second column
# to get unique data using dropDuplicates function()
dataframe.select(["Employee ID",
                  "Employee NAME"]).dropDuplicates().show()

输出：

方法一：使用distinct()

此函数使用 distinct()函数从列返回不同的值。

Syntax: dataframe.select(“column_name”).distinct().show()

编程需要懂一点英语

示例 1：对于单个列。

蟒蛇3

# select first column to get
# unique data using distinct function()
dataframe.select("Employee ID").distinct().show()

输出：

+-----------+
|Employee ID|
+-----------+
|          3|
|          5|
|          1|
|          4|
|          2|
+-----------+

示例 2：对于多列。

使用 distinct()函数显示 2 列中的唯一数据的Python代码。

Syntax: dataframe.select(“column_name 1, column_name 2 “).distinct().show()

编程需要懂一点英语

代码：

蟒蛇3

# select first and second column
# to get unique data using distinct function()
dataframe.select(["Employee ID",
                  "Employee NAME"]).distinct().show()

输出：

+-----------+-------------+
|Employee ID|Employee NAME|
+-----------+-------------+
|          5|      gnanesh|
|          4|       rohith|
|          1|       sravan|
|          2|       ojaswi|
|          3|        bobby|
+-----------+-------------+

方法 2：使用 dropDuplicates()

此函数使用 dropDuplicates()函数在数据框中的一列中显示唯一数据。

Syntax: dataframe.select(“column_name”).dropDuplicates().show()

编程需要懂一点英语

示例 1：对于单列。

蟒蛇3

#select first  column to get
# unique data using dropDuplicates function()
dataframe.select("Employee ID").dropDuplicates().show()

输出：

+-----------+
|Employee ID|
+-----------+
|          3|
|          5|
|          1|
|          4|
|          2|
+-----------+

示例 2：对于多列

使用 dropDuplicates()函数显示来自 2 列的唯一数据的Python代码

蟒蛇3

#select first and second column
# to get unique data using dropDuplicates function()
dataframe.select(["Employee ID",
                  "Employee NAME"]).dropDuplicates().show()

输出：

+-----------+-------------+
|Employee ID|Employee NAME|
+-----------+-------------+
|          5|      gnanesh|
|          4|       rohith|
|          1|       sravan|
|          2|       ojaswi|
|          3|        bobby|
+-----------+-------------+