PySpark DataFrame – 选择除一个或一组列之外的所有列
在本文中,我们将从 Pyspark 数据框中提取除一组列或一列之外的所有列。为此,我们将使用 select()、drop() 函数。
但首先,让我们为演示创建 Dataframe。
Python3
# importing module
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of students data
data = [["1", "sravan", "vignan"],
["2", "ojaswi", "vvit"],
["3", "rohith", "vvit"],
["4", "sridevi", "vignan"],
["1", "sravan", "vignan"],
["5", "gnanesh", "iit"]]
# specify column names
columns = ['student ID', 'student NAME', 'college']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
print('Actual data in dataframe')
dataframe.show()
Python3
# drop student id
dataframe.drop('student ID').show()
Python3
# drop student id and college
dataframe.drop('student ID','college').show()
Python3
# select student id
dataframe.select('student ID').show()
Python3
# select student id and student name
dataframe.select('student ID','student NAME').show()
输出:
方法一:使用 drop()函数
drop() 用于从数据框中删除列。
Syntax: dataframe.drop(‘column_names’)
Where dataframe is the input dataframe and column names are the columns to be dropped
示例:通过删除一列来选择数据的Python程序
蟒蛇3
# drop student id
dataframe.drop('student ID').show()
输出:
示例 2:删除多个列(列集)的Python程序
蟒蛇3
# drop student id and college
dataframe.drop('student ID','college').show()
输出:
方法二:使用select()函数
此函数用于从数据框中选择列
Syntax: dataframe.select(columns)
Where dataframe is the input dataframe and columns are the input columns
示例 1:从数据框中选择一列。
蟒蛇3
# select student id
dataframe.select('student ID').show()
输出:
示例2: Python程序选择两列id和name
蟒蛇3
# select student id and student name
dataframe.select('student ID','student NAME').show()
输出: