📜  如何在 PySpark Dataframe 中的 groupBy 之后计算唯一 ID?

📅  最后修改于: 2022-05-13 01:55:32.473000             🧑  作者: Mango

如何在 PySpark Dataframe 中的 groupBy 之后计算唯一 ID?

在本文中,我们将讨论如何在 PySpark Dataframe 中按 group by 计算唯一 ID。

为此,我们将使用两种不同的方法:

  • 使用 distinct().count() 方法。
  • 使用 SQL 查询。

但首先,让我们创建 Dataframe 进行演示:

Python3
# importing module
import pyspark
  
# importing sparksession from pyspark.sql
# module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of students  data
data = [["1", "sravan", "vignan", 95],
        ["2", "ojaswi", "vvit", 78],
        ["3", "rohith", "vvit", 89],
        ["2", "ojaswi", "vvit", 100],
        ["4", "sridevi", "vignan", 88],
        ["1", "sravan", "vignan", 78],
        ["4", "sridevi", "vignan", 90],
        ["5", "gnanesh", "iit", 67]]
  
# specify column names
columns = ['student ID', 'student NAME',
           'college', 'subject marks']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
  
print("the data is ")
dataframe.show()


Python3
# group by studentID by marks 
dataframe = dataframe.groupBy(
  'student ID').sum('subject marks')
  
# display count of unique ID
print("Unique ID count after Group By : ",
      dataframe.distinct().count())
  
print("the data is ")
  
# display  values  of unique ID
dataframe.distinct().show()


Python3
# group by studentID by marks 
dataframe = dataframe.groupBy(
  'student ID').sum('subject marks')
  
# display count of unique ID
print("Unique ID count after Group By : ",
      dataframe.distinct().count())
  
  
print("the data is ")
  
# display  values  of unique ID
dataframe.select('student ID').distinct().show()


Python3
# importing module
import pyspark
  
# importing sparksession from pyspark.sql 
# module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of students  data
data = [["1", "sravan", "vignan", 95],
        ["2", "ojaswi", "vvit", 78],
        ["3", "rohith", "vvit", 89],
        ["2", "ojaswi", "vvit", 100],
        ["4", "sridevi", "vignan", 88],
        ["1", "sravan", "vignan", 78],
        ["4", "sridevi", "vignan", 90],
        ["5", "gnanesh", "iit", 67]]
  
# specify column names
columns = ['student ID', 'student NAME',
           'college', 'subject marks']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
  
# group by studentID by marks
dataframe = dataframe.groupBy('student ID').sum('subject marks')
  
# create view for the ablve dataframe and
# view name is "DATA"
dataframe.createOrReplaceTempView("DATA")
  
# count unique data with sql query
spark.sql("SELECT DISTINCT(COUNT('student ID'))  \
FROM DATA GROUP BY 'subject marks'").show()


输出:



方法 1:使用 groupBy() 和 distinct().count() 方法

groupBy():用于根据列名对数据进行分组

distinct().count():用于计算和显示数据帧中的不同行

示例 1:

蟒蛇3



# group by studentID by marks 
dataframe = dataframe.groupBy(
  'student ID').sum('subject marks')
  
# display count of unique ID
print("Unique ID count after Group By : ",
      dataframe.distinct().count())
  
print("the data is ")
  
# display  values  of unique ID
dataframe.distinct().show()

输出:

Unique ID count after Group By :  5
the data is 
+----------+------------------+
|student ID|sum(subject marks)|
+----------+------------------+
|         3|                89|
|         5|                67|
|         1|               173|
|         4|               178|
|         2|               178|
+----------+------------------+

示例 2:计算并显示单列的唯一 ID:

蟒蛇3

# group by studentID by marks 
dataframe = dataframe.groupBy(
  'student ID').sum('subject marks')
  
# display count of unique ID
print("Unique ID count after Group By : ",
      dataframe.distinct().count())
  
  
print("the data is ")
  
# display  values  of unique ID
dataframe.select('student ID').distinct().show()

输出:

Unique ID count after Group By :  5
the data is 
+----------+
|student ID|
+----------+
|         3|
|         5|
|         1|
|         4|
|         2|
+----------+

方法二:使用SQL查询

我们可以通过使用 spark.sql 来获取唯一的 ID 计数

语法

蟒蛇3

# importing module
import pyspark
  
# importing sparksession from pyspark.sql 
# module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of students  data
data = [["1", "sravan", "vignan", 95],
        ["2", "ojaswi", "vvit", 78],
        ["3", "rohith", "vvit", 89],
        ["2", "ojaswi", "vvit", 100],
        ["4", "sridevi", "vignan", 88],
        ["1", "sravan", "vignan", 78],
        ["4", "sridevi", "vignan", 90],
        ["5", "gnanesh", "iit", 67]]
  
# specify column names
columns = ['student ID', 'student NAME',
           'college', 'subject marks']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
  
# group by studentID by marks
dataframe = dataframe.groupBy('student ID').sum('subject marks')
  
# create view for the ablve dataframe and
# view name is "DATA"
dataframe.createOrReplaceTempView("DATA")
  
# count unique data with sql query
spark.sql("SELECT DISTINCT(COUNT('student ID'))  \
FROM DATA GROUP BY 'subject marks'").show()

输出:

+-----------------+
|count(student ID)|
+-----------------+
|                5|
+-----------------+