使用条件删除 PySpark DataFrame 中的行
在本文中,我们将删除 PySpark 数据框中的行。我们将考虑最常见的条件,例如删除具有 Null 值的行、删除重复的行等。所有这些条件使用不同的函数,我们将详细讨论这些。
我们将涵盖以下主题:
- 使用 where() 和 filter() 关键字删除带有条件的行。
- 删除带有 NA 或缺失值的行
- 删除具有 Null 值的行
- 删除重复的行。
- 根据列删除重复行
创建用于演示的数据框:
Python3
# importing module
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of students data
data = [["1", "sravan", "vignan"],
["2", "ojaswi", "vvit"],
["3", "rohith", "vvit"],
["4", "sridevi", "vignan"],
["6", "ravi", "vrs"],
["5", "gnanesh", "iit"]]
# specify column names
columns = ['ID', 'NAME', 'college']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
print('Actual data in dataframe')
dataframe.show()
Python3
# drop rows with id less than 4
dataframe.where(dataframe.ID>4).show()
Python3
# drop rows with college vrs
dataframe.where(dataframe.college != 'vrs').show()
Python3
# drop rows with id 4
dataframe.filter(dataframe.ID!='4').show()
Python3
# importing module
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data with 5 row values
data = [["1", "sravan", "company 1"],
["2", "ojaswi", "company 2"],
[None, "bobby", "company 3"],
["1", "sravan", "company 1"],
["2", "ojaswi", None],
["4", "rohith", "company 2"],
["5", "gnanesh", "company 1"],
["2", None, "company 2"],
["3", "bobby", "company 3"],
["4", "rohith", "company 2"]]
# specify column names
columns = ['Employee ID', 'Employee NAME', 'Company Name']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
# display actual dataframe
dataframe.show()
# drop missing values
dataframe = dataframe.dropna()
# display dataframe after dropping null values
dataframe.show()
Python3
# importing module
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data with 5 row values
data = [["1", "sravan", "company 1"],
["2", "ojaswi", "company 2"],
[None, "bobby", "company 3"],
["1", "sravan", "company 1"],
["2", "ojaswi", None],
[None, "rohith", "company 2"],
["5", "gnanesh", "company 1"],
["2", None, "company 2"],
["3", "bobby", "company 3"],
["4", "rohith", "company 2"]]
# specify column names
columns = ['ID', 'Employee NAME', 'Company Name']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
dataframe.show()
# removing null values in ID column
dataframe.where(dataframe.ID.isNotNull()).show()
Python3
# importing module
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data with 5 row values
data = [["1", "sravan", "company 1"],
["2", "ojaswi", "company 2"],
["3", "bobby", "company 3"],
["1", "sravan", "company 1"],
["2", "ojaswi", "company 2"],
["6", "rohith", "company 2"],
["5", "gnanesh", "company 1"],
["2", "ojaswi", "company 2"],
["3", "bobby", "company 3"],
["4", "rohith", "company 2"]]
# specify column names
columns = ['ID', 'Employee NAME', 'Company Name']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
dataframe.show()
# remove the duplicates
dataframe.dropDuplicates().show()
Python3
# remove the duplicates
dataframe.dropDuplicates(['Employee NAME']).show()
Python3
# importing module
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data with 5 row values
data = [["1", "sravan", "company 1"],
["2", "ojaswi", "company 2"],
["3", "bobby", "company 3"],
["1", "sravan", "company 1"],
["2", "ojaswi", "company 2"],
["6", "rohith", "company 2"],
["5", "gnanesh", "company 1"],
["2", "ojaswi", "company 2"],
["3", "bobby", "company 3"],
["4", "rohith", "company 2"]]
# specify column names
columns = ['ID', 'Employee NAME', 'Company Name']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
# remove the duplicates by using distinct function
dataframe.distinct().show()
输出:
使用 where() 和 filter()函数删除有条件的行
在这里,我们将使用 where() 和 filter()函数删除具有条件的行。
where():该函数用于检查条件并给出结果。这意味着它会根据条件删除行
Syntax: dataframe.where(condition)
filter():这个函数用于检查条件并给出结果,这意味着它根据条件删除行。
Syntax: dataframe.filter(condition)
示例 1:使用 Where()
Python程序删除ID小于4的行
蟒蛇3
# drop rows with id less than 4
dataframe.where(dataframe.ID>4).show()
输出:
删除带有大学“vrs”的行:
蟒蛇3
# drop rows with college vrs
dataframe.where(dataframe.college != 'vrs').show()
输出:
示例 2:使用 filter()函数
Python程序删除 id=4 的行
蟒蛇3
# drop rows with id 4
dataframe.filter(dataframe.ID!='4').show()
输出:
使用 dropna 删除具有 NA 值的行
NA 值是数据框中的缺失值,我们将删除具有缺失值的行。它们表示为空,通过使用 dropna() 方法我们可以过滤行。
Syntax: dataframe.dropna()
蟒蛇3
# importing module
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data with 5 row values
data = [["1", "sravan", "company 1"],
["2", "ojaswi", "company 2"],
[None, "bobby", "company 3"],
["1", "sravan", "company 1"],
["2", "ojaswi", None],
["4", "rohith", "company 2"],
["5", "gnanesh", "company 1"],
["2", None, "company 2"],
["3", "bobby", "company 3"],
["4", "rohith", "company 2"]]
# specify column names
columns = ['Employee ID', 'Employee NAME', 'Company Name']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
# display actual dataframe
dataframe.show()
# drop missing values
dataframe = dataframe.dropna()
# display dataframe after dropping null values
dataframe.show()
输出:
使用 isNotNull 删除具有 Null 值的行
在这里,我们正在删除具有空值的行,我们使用 isNotNull()函数来删除行
Syntax: dataframe.where(dataframe.column.isNotNull())
基于特定列删除空值的Python程序
蟒蛇3
# importing module
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data with 5 row values
data = [["1", "sravan", "company 1"],
["2", "ojaswi", "company 2"],
[None, "bobby", "company 3"],
["1", "sravan", "company 1"],
["2", "ojaswi", None],
[None, "rohith", "company 2"],
["5", "gnanesh", "company 1"],
["2", None, "company 2"],
["3", "bobby", "company 3"],
["4", "rohith", "company 2"]]
# specify column names
columns = ['ID', 'Employee NAME', 'Company Name']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
dataframe.show()
# removing null values in ID column
dataframe.where(dataframe.ID.isNotNull()).show()
输出:
删除重复行
重复行意味着数据帧中的行是相同的,我们将使用 dropDuplicates()函数删除这些行。
示例 1:删除重复行的Python代码。
Syntax: dataframe.dropDuplicates()
蟒蛇3
# importing module
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data with 5 row values
data = [["1", "sravan", "company 1"],
["2", "ojaswi", "company 2"],
["3", "bobby", "company 3"],
["1", "sravan", "company 1"],
["2", "ojaswi", "company 2"],
["6", "rohith", "company 2"],
["5", "gnanesh", "company 1"],
["2", "ojaswi", "company 2"],
["3", "bobby", "company 3"],
["4", "rohith", "company 2"]]
# specify column names
columns = ['ID', 'Employee NAME', 'Company Name']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
dataframe.show()
# remove the duplicates
dataframe.dropDuplicates().show()
输出:
示例 2:根据列名删除重复项。
Syntax: dataframe.dropDuplicates([‘column_name’])
基于员工姓名删除重复项的Python代码
蟒蛇3
# remove the duplicates
dataframe.dropDuplicates(['Employee NAME']).show()
输出:
使用不同的函数删除重复的行
我们可以使用不同的函数删除重复的行。
Syntax: dataframe.distinct()
蟒蛇3
# importing module
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data with 5 row values
data = [["1", "sravan", "company 1"],
["2", "ojaswi", "company 2"],
["3", "bobby", "company 3"],
["1", "sravan", "company 1"],
["2", "ojaswi", "company 2"],
["6", "rohith", "company 2"],
["5", "gnanesh", "company 1"],
["2", "ojaswi", "company 2"],
["3", "bobby", "company 3"],
["4", "rohith", "company 2"]]
# specify column names
columns = ['ID', 'Employee NAME', 'Company Name']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
# remove the duplicates by using distinct function
dataframe.distinct().show()
输出: