如何在 PySpark Dataframe 中的 groupBy 之后计算唯一 ID?
在本文中,我们将讨论如何在 PySpark Dataframe 中按 group by 计算唯一 ID。
为此,我们将使用两种不同的方法:
- 使用 distinct().count() 方法。
- 使用 SQL 查询。
但首先,让我们创建 Dataframe 进行演示:
Python3
# importing module
import pyspark
# importing sparksession from pyspark.sql
# module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of students data
data = [["1", "sravan", "vignan", 95],
["2", "ojaswi", "vvit", 78],
["3", "rohith", "vvit", 89],
["2", "ojaswi", "vvit", 100],
["4", "sridevi", "vignan", 88],
["1", "sravan", "vignan", 78],
["4", "sridevi", "vignan", 90],
["5", "gnanesh", "iit", 67]]
# specify column names
columns = ['student ID', 'student NAME',
'college', 'subject marks']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
print("the data is ")
dataframe.show()
Python3
# group by studentID by marks
dataframe = dataframe.groupBy(
'student ID').sum('subject marks')
# display count of unique ID
print("Unique ID count after Group By : ",
dataframe.distinct().count())
print("the data is ")
# display values of unique ID
dataframe.distinct().show()
Python3
# group by studentID by marks
dataframe = dataframe.groupBy(
'student ID').sum('subject marks')
# display count of unique ID
print("Unique ID count after Group By : ",
dataframe.distinct().count())
print("the data is ")
# display values of unique ID
dataframe.select('student ID').distinct().show()
Python3
# importing module
import pyspark
# importing sparksession from pyspark.sql
# module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of students data
data = [["1", "sravan", "vignan", 95],
["2", "ojaswi", "vvit", 78],
["3", "rohith", "vvit", 89],
["2", "ojaswi", "vvit", 100],
["4", "sridevi", "vignan", 88],
["1", "sravan", "vignan", 78],
["4", "sridevi", "vignan", 90],
["5", "gnanesh", "iit", 67]]
# specify column names
columns = ['student ID', 'student NAME',
'college', 'subject marks']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
# group by studentID by marks
dataframe = dataframe.groupBy('student ID').sum('subject marks')
# create view for the ablve dataframe and
# view name is "DATA"
dataframe.createOrReplaceTempView("DATA")
# count unique data with sql query
spark.sql("SELECT DISTINCT(COUNT('student ID')) \
FROM DATA GROUP BY 'subject marks'").show()
输出:
方法 1:使用 groupBy() 和 distinct().count() 方法
groupBy():用于根据列名对数据进行分组
Syntax: dataframe=dataframe.groupBy(‘column_name1’).sum(‘column name 2’)
distinct().count():用于计算和显示数据帧中的不同行
Syntax: dataframe.distinct().count()
示例 1:
蟒蛇3
# group by studentID by marks
dataframe = dataframe.groupBy(
'student ID').sum('subject marks')
# display count of unique ID
print("Unique ID count after Group By : ",
dataframe.distinct().count())
print("the data is ")
# display values of unique ID
dataframe.distinct().show()
输出:
Unique ID count after Group By : 5
the data is
+----------+------------------+
|student ID|sum(subject marks)|
+----------+------------------+
| 3| 89|
| 5| 67|
| 1| 173|
| 4| 178|
| 2| 178|
+----------+------------------+
示例 2:计算并显示单列的唯一 ID:
蟒蛇3
# group by studentID by marks
dataframe = dataframe.groupBy(
'student ID').sum('subject marks')
# display count of unique ID
print("Unique ID count after Group By : ",
dataframe.distinct().count())
print("the data is ")
# display values of unique ID
dataframe.select('student ID').distinct().show()
输出:
Unique ID count after Group By : 5
the data is
+----------+
|student ID|
+----------+
| 3|
| 5|
| 1|
| 4|
| 2|
+----------+
方法二:使用SQL查询
我们可以通过使用 spark.sql 来获取唯一的 ID 计数
语法:
spark.sql(“sql query”).show()
蟒蛇3
# importing module
import pyspark
# importing sparksession from pyspark.sql
# module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of students data
data = [["1", "sravan", "vignan", 95],
["2", "ojaswi", "vvit", 78],
["3", "rohith", "vvit", 89],
["2", "ojaswi", "vvit", 100],
["4", "sridevi", "vignan", 88],
["1", "sravan", "vignan", 78],
["4", "sridevi", "vignan", 90],
["5", "gnanesh", "iit", 67]]
# specify column names
columns = ['student ID', 'student NAME',
'college', 'subject marks']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
# group by studentID by marks
dataframe = dataframe.groupBy('student ID').sum('subject marks')
# create view for the ablve dataframe and
# view name is "DATA"
dataframe.createOrReplaceTempView("DATA")
# count unique data with sql query
spark.sql("SELECT DISTINCT(COUNT('student ID')) \
FROM DATA GROUP BY 'subject marks'").show()
输出:
+-----------------+
|count(student ID)|
+-----------------+
| 5|
+-----------------+