📜  使用条件删除 PySpark DataFrame 中的行

📅  最后修改于: 2022-05-13 01:55:33.074000             🧑  作者: Mango

使用条件删除 PySpark DataFrame 中的行

在本文中,我们将删除 PySpark 数据框中的行。我们将考虑最常见的条件,例如删除具有 Null 值的行、删除重复的行等。所有这些条件使用不同的函数,我们将详细讨论这些。

我们将涵盖以下主题:

  • 使用 where() 和 filter() 关键字删除带有条件的行。
  • 删除带有 NA 或缺失值的行
  • 删除具有 Null 值的行
  • 删除重复的行。
  • 根据列删除重复行

创建用于演示的数据框:

Python3
# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of students  data
data = [["1", "sravan", "vignan"],
        ["2", "ojaswi", "vvit"],
        ["3", "rohith", "vvit"],
        ["4", "sridevi", "vignan"],
        ["6", "ravi", "vrs"],
        ["5", "gnanesh", "iit"]]
  
# specify column names
columns = ['ID', 'NAME', 'college']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
  
print('Actual data in dataframe')
dataframe.show()


Python3
# drop rows with id less than 4
dataframe.where(dataframe.ID>4).show()


Python3
# drop rows with college vrs
dataframe.where(dataframe.college != 'vrs').show()


Python3
# drop rows with id 4
dataframe.filter(dataframe.ID!='4').show()


Python3
# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of employee data with 5 row values
data = [["1", "sravan", "company 1"],
        ["2", "ojaswi", "company 2"],
        [None, "bobby", "company 3"],
        ["1", "sravan", "company 1"],
        ["2", "ojaswi", None],
        ["4", "rohith", "company 2"],
        ["5", "gnanesh", "company 1"],
        ["2", None, "company 2"],
        ["3", "bobby", "company 3"],
        ["4", "rohith", "company 2"]]
  
# specify column names
columns = ['Employee ID', 'Employee NAME', 'Company Name']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
  
# display actual dataframe
dataframe.show()
  
# drop missing values
dataframe = dataframe.dropna()
  
# display  dataframe after dropping null values
dataframe.show()


Python3
# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of employee data with 5 row values
data = [["1", "sravan", "company 1"],
        ["2", "ojaswi", "company 2"],
        [None, "bobby", "company 3"],
        ["1", "sravan", "company 1"],
        ["2", "ojaswi", None],
        [None, "rohith", "company 2"],
        ["5", "gnanesh", "company 1"],
        ["2", None, "company 2"],
        ["3", "bobby", "company 3"],
        ["4", "rohith", "company 2"]]
  
# specify column names
columns = ['ID', 'Employee NAME', 'Company Name']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
dataframe.show()
  
# removing null values in ID column
dataframe.where(dataframe.ID.isNotNull()).show()


Python3
# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of employee data with 5 row values
data = [["1", "sravan", "company 1"],
        ["2", "ojaswi", "company 2"],
        ["3", "bobby", "company 3"],
        ["1", "sravan", "company 1"],
        ["2", "ojaswi", "company 2"],
        ["6", "rohith", "company 2"],
        ["5", "gnanesh", "company 1"],
        ["2", "ojaswi", "company 2"],
        ["3", "bobby", "company 3"],
        ["4", "rohith", "company 2"]]
  
# specify column names
columns = ['ID', 'Employee NAME', 'Company Name']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
dataframe.show()
  
# remove the duplicates
dataframe.dropDuplicates().show()


Python3
# remove the duplicates
dataframe.dropDuplicates(['Employee NAME']).show()


Python3
# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of employee data with 5 row values
data = [["1", "sravan", "company 1"],
        ["2", "ojaswi", "company 2"],
        ["3", "bobby", "company 3"],
        ["1", "sravan", "company 1"],
        ["2", "ojaswi", "company 2"],
        ["6", "rohith", "company 2"],
        ["5", "gnanesh", "company 1"],
        ["2", "ojaswi", "company 2"],
        ["3", "bobby", "company 3"],
        ["4", "rohith", "company 2"]]
  
# specify column names
columns = ['ID', 'Employee NAME', 'Company Name']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
  
# remove the duplicates by using distinct function
dataframe.distinct().show()


输出:

使用 where() 和 filter()函数删除有条件的行

在这里,我们将使用 where() 和 filter()函数删除具有条件的行。

where():该函数用于检查条件并给出结果。这意味着它会根据条件删除行

filter():这个函数用于检查条件并给出结果,这意味着它根据条件删除行。

示例 1:使用 Where()

Python程序删除ID小于4的行

蟒蛇3

# drop rows with id less than 4
dataframe.where(dataframe.ID>4).show()

输出:

删除带有大学“vrs”的行:

蟒蛇3

# drop rows with college vrs
dataframe.where(dataframe.college != 'vrs').show()

输出:

示例 2:使用 filter()函数

Python程序删除 id=4 的行

蟒蛇3

# drop rows with id 4
dataframe.filter(dataframe.ID!='4').show()

输出:

使用 dropna 删除具有 NA 值的行

NA 值是数据框中的缺失值,我们将删除具有缺失值的行。它们表示为空,通过使用 dropna() 方法我们可以过滤行。

蟒蛇3

# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of employee data with 5 row values
data = [["1", "sravan", "company 1"],
        ["2", "ojaswi", "company 2"],
        [None, "bobby", "company 3"],
        ["1", "sravan", "company 1"],
        ["2", "ojaswi", None],
        ["4", "rohith", "company 2"],
        ["5", "gnanesh", "company 1"],
        ["2", None, "company 2"],
        ["3", "bobby", "company 3"],
        ["4", "rohith", "company 2"]]
  
# specify column names
columns = ['Employee ID', 'Employee NAME', 'Company Name']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
  
# display actual dataframe
dataframe.show()
  
# drop missing values
dataframe = dataframe.dropna()
  
# display  dataframe after dropping null values
dataframe.show()

输出:

使用 isNotNull 删除具有 Null 值的行

在这里,我们正在删除具有空值的行,我们使用 isNotNull()函数来删除行

基于特定列删除空值的Python程序

蟒蛇3

# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of employee data with 5 row values
data = [["1", "sravan", "company 1"],
        ["2", "ojaswi", "company 2"],
        [None, "bobby", "company 3"],
        ["1", "sravan", "company 1"],
        ["2", "ojaswi", None],
        [None, "rohith", "company 2"],
        ["5", "gnanesh", "company 1"],
        ["2", None, "company 2"],
        ["3", "bobby", "company 3"],
        ["4", "rohith", "company 2"]]
  
# specify column names
columns = ['ID', 'Employee NAME', 'Company Name']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
dataframe.show()
  
# removing null values in ID column
dataframe.where(dataframe.ID.isNotNull()).show()

输出:

删除重复行

重复行意味着数据帧中的行是相同的,我们将使用 dropDuplicates()函数删除这些行。

示例 1:删除重复行的Python代码。

蟒蛇3

# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of employee data with 5 row values
data = [["1", "sravan", "company 1"],
        ["2", "ojaswi", "company 2"],
        ["3", "bobby", "company 3"],
        ["1", "sravan", "company 1"],
        ["2", "ojaswi", "company 2"],
        ["6", "rohith", "company 2"],
        ["5", "gnanesh", "company 1"],
        ["2", "ojaswi", "company 2"],
        ["3", "bobby", "company 3"],
        ["4", "rohith", "company 2"]]
  
# specify column names
columns = ['ID', 'Employee NAME', 'Company Name']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
dataframe.show()
  
# remove the duplicates
dataframe.dropDuplicates().show()

输出:

示例 2:根据列名删除重复项。

基于员工姓名删除重复项的Python代码

蟒蛇3

# remove the duplicates
dataframe.dropDuplicates(['Employee NAME']).show()

输出:

使用不同的函数删除重复的行

我们可以使用不同的函数删除重复的行。

蟒蛇3

# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of employee data with 5 row values
data = [["1", "sravan", "company 1"],
        ["2", "ojaswi", "company 2"],
        ["3", "bobby", "company 3"],
        ["1", "sravan", "company 1"],
        ["2", "ojaswi", "company 2"],
        ["6", "rohith", "company 2"],
        ["5", "gnanesh", "company 1"],
        ["2", "ojaswi", "company 2"],
        ["3", "bobby", "company 3"],
        ["4", "rohith", "company 2"]]
  
# specify column names
columns = ['ID', 'Employee NAME', 'Company Name']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
  
# remove the duplicates by using distinct function
dataframe.distinct().show()

输出: