PySpark - 从 DataFrame 中选择列
在本文中,我们将讨论如何从 pyspark 数据框中选择列。为此,我们将使用 select()函数。
Syntax: dataframe.select(parameter).show()
where,
- dataframe is the dataframe name
- parameter is the column(s) to be selected
- show() function is used to display the selected column
让我们创建一个示例数据框
Python3
# importing module
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of students data
data = [["1", "sravan", "vignan"], ["2", "ojaswi", "vvit"],
["3", "rohith", "vvit"], ["4", "sridevi", "vignan"],
["1", "sravan", "vignan"], ["5", "gnanesh", "iit"]]
# specify column names
columns = ['student ID', 'student NAME', 'college']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
print("Actual data in dataframe")
# show dataframe
dataframe.show()
Python3
# select column with column name
dataframe.select('student ID').show()
Python3
# select multiple column with column name
dataframe.select(['student ID', 'student NAME', 'college']).show()
Python3
# select column with column number 1
dataframe.select(dataframe.columns[1]).show()
Python3
# select column with column number slice
# operator
dataframe.select(dataframe.columns[0:3]).show()
输出:
选择单列
通过列名,我们可以得到数据框中的整列
Syntax: dataframe.select(“column_name”).show()
蟒蛇3
# select column with column name
dataframe.select('student ID').show()
输出:
选择多列
使用多个列名,我们可以获得数据框中的整列
Syntax: dataframe.select([“column_name1″,”column_name 2″,”column_name n”]).show()
蟒蛇3
# select multiple column with column name
dataframe.select(['student ID', 'student NAME', 'college']).show()
输出:
使用列号选择
在这里,我们将根据列号选择列。这可以使用索引运算符来完成。我们可以将列号作为索引传递给 dataframe.columns[]。
Syntax: dataframe.select(dataframe.columns[column_number]).show()
蟒蛇3
# select column with column number 1
dataframe.select(dataframe.columns[1]).show()
输出:
根据列号访问多列。在这里,我们将使用切片运算符来选择多列。
Syntax: dataframe.select(dataframe.columns[column_start:column_end]).show()
where, column_start is the starting index and column_end is the ending index
蟒蛇3
# select column with column number slice
# operator
dataframe.select(dataframe.columns[0:3]).show()
输出: