从 PySpark DataFrame 中删除一列或多列
在本文中,我们将讨论如何删除 Pyspark 数据框中的列。
在 pyspark 中,可以使用drop()函数从数据框中删除值/列。
Syntax: dataframe_name.na.drop(how=”any/all”,thresh=threshold_value,subset=[“column_name_1″,”column_name_2”])
- how – This takes either of the two values ‘any’ or ‘all’. ‘any’, drop a row if it contains NULLs on any columns and ‘all’, drop a row only if all columns have NULL values. By default it is set to ‘any’
- thresh – This takes an integer value and drops rows that have less than that thresh hold non-null values. By default it is set to ‘None’.
- subset – This parameter is used to select a specific column to target the NULL values in it. By default it’s ‘None
用于创建具有三列的学生数据框的Python代码:
Python3
# importing module
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data with 5 row values
data =[["1", "sravan", "company 1"],
["3", "bobby", "company 3"],
["2", "ojaswi", "company 2"],
["1", "sravan", "company 1"],
["3", "bobby", "company 3"],
["4", "rohith", "company 2"],
["5", "gnanesh", "company 1"]]
# specify column names
columns = ['Employee ID','Employee NAME','Company Name']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data,columns)
dataframe.show()
Python3
# delete single column
dataframe = dataframe.drop('Employee ID')
dataframe.show()
Python3
# delete two columns
dataframe = dataframe.drop(*('Employee NAME',
'Employee ID'))
dataframe.show()
Python3
list = ['Employee ID','Employee NAME','Company Name']
# delete two columns
dataframe = dataframe.drop(*list)
dataframe.show()
输出:
+-----------+-------------+------------+
|Employee ID|Employee NAME|Company Name|
+-----------+-------------+------------+
| 1| sravan| company 1|
| 3| bobby| company 3|
| 2| ojaswi| company 2|
| 1| sravan| company 1|
| 3| bobby| company 3|
| 4| rohith| company 2|
| 5| gnanesh| company 1|
+-----------+-------------+------------+
示例 1:删除单个列。
在这里,我们将从数据框中删除单个列。
Syntax: dataframe.drop(‘column name’)
代码:
蟒蛇3
# delete single column
dataframe = dataframe.drop('Employee ID')
dataframe.show()
输出:
+-------------+------------+
|Employee NAME|Company Name|
+-------------+------------+
| sravan| company 1|
| bobby| company 3|
| ojaswi| company 2|
| sravan| company 1|
| bobby| company 3|
| rohith| company 2|
| gnanesh| company 1|
+-------------+------------+Example 2:
示例 2:删除多列。
在这里,我们将从数据框中删除多个列。
Syntax: dataframe.drop(*(‘column 1′,’column 2′,’column n’))
代码:
蟒蛇3
# delete two columns
dataframe = dataframe.drop(*('Employee NAME',
'Employee ID'))
dataframe.show()
输出:
+------------+
|Company Name|
+------------+
| company 1|
| company 3|
| company 2|
| company 1|
| company 3|
| company 2|
| company 1|
+------------+
示例 3:删除所有列
在这里,我们将从数据框中删除所有列,为此我们将列的名称作为列表并将其传递给 drop()。
蟒蛇3
list = ['Employee ID','Employee NAME','Company Name']
# delete two columns
dataframe = dataframe.drop(*list)
dataframe.show()
输出:
++
||
++
||
||
||
||
||
||
||
++