📜  将 PySpark DataFrame 列转换为Python列表

📅  最后修改于: 2022-05-13 01:54:19.414000             🧑  作者: Mango

将 PySpark DataFrame 列转换为Python列表

在本文中,我们将讨论如何将 Pyspark 数据框列转换为Python列表。

创建用于演示的数据框:

Python3
# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of students  data
data = [["1", "sravan", "vignan", 67, 89],
        ["2", "ojaswi", "vvit", 78, 89],
        ["3", "rohith", "vvit", 100, 80],
        ["4", "sridevi", "vignan", 78, 80],
        ["1", "sravan", "vignan", 89, 98],
        ["5", "gnanesh", "iit", 94, 98]]
  
# specify column names
columns = ['student ID', 'student NAME',
           'college', 'subject1', 'subject2']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
  
# display dataframe
dataframe.show()


Python3
# convert student Name to list using 
# flatMap
print(dataframe.select('student Name').
      rdd.flatMap(lambda x: x).collect())
  
# convert student ID to list using 
# flatMap
print(dataframe.select('student ID').
      rdd.flatMap(lambda x: x).collect())


Python3
# convert multiple columns  to list using flatMap
print(dataframe.select(['student Name',
                        'student Name',
                        'college']).
      rdd.flatMap(lambda x: x).collect())


Python3
# convert  student Name  to list using map
print(dataframe.select('student Name').
      rdd.map(lambda x : x[0]).collect())
  
# convert  student ID  to list using map
print(dataframe.select('student ID').
      rdd.map(lambda x : x[0]).collect())
  
# convert  student college  to list using 
# map
print(dataframe.select('college').
      rdd.map(lambda x : x[0]).collect())


Python3
# display college column in
# the list format using comphrension
print([data[0] for data in dataframe.
       select('college').collect()])
  
  
# display student ID column in the
# list format using comphrension
print([data[0] for data in dataframe.
       select('student ID').collect()])
  
# display subject1  column in the list
# format using comphrension
print([data[0] for data in dataframe.
       select('subject1').collect()])
  
# display subject2  column in the
# list format using comphrension
print([data[0] for data in dataframe.
       select('subject2').collect()])


Python3
# display college column in the list
# format using comphrension
print([data[0] for data in dataframe.
       select('college').collect()])
  
  
# display student ID column in the
# list format using comphrension
print([data[0] for data in dataframe.
       select('student ID').toLocalIterator()])
  
# display subject1  column in the list
# format using comphrension
print([data[0] for data in dataframe.
       select('subject1').toLocalIterator()])
  
# display subject2  column in the
# list format using comphrension
print([data[0] for data in dataframe.
       select('subject2').toLocalIterator()])


Python3
# display college  column in
# the list format using toPandas
print(list(dataframe.select('college').
           toPandas()['college']))
  
  
# display student NAME  column in
# the list format using toPandas
print(list(dataframe.select('student NAME').
           toPandas()['student NAME']))
  
# display subject1  column in
# the list format using toPandas
print(list(dataframe.select('subject1').
           toPandas()['subject1']))
  
# display subject2  column
# in the list format using toPandas
print(list(dataframe.select('subject2').
           toPandas()['subject2']))


输出:



方法一:使用 flatMap()

此方法将所选列作为使用 rdd 的输入并将其转换为列表。

示例 1:使用 flatMap 将特定列转换为列表的Python代码

蟒蛇3

# convert student Name to list using 
# flatMap
print(dataframe.select('student Name').
      rdd.flatMap(lambda x: x).collect())
  
# convert student ID to list using 
# flatMap
print(dataframe.select('student ID').
      rdd.flatMap(lambda x: x).collect())

输出:

示例 2:将多列转换为列表。

蟒蛇3

# convert multiple columns  to list using flatMap
print(dataframe.select(['student Name',
                        'student Name',
                        'college']).
      rdd.flatMap(lambda x: x).collect())

输出:

方法 2:使用 map()

此函数用于将给定的数据框列映射到列表

示例:使用 map函数将 pyspark 数据框列转换为列表的Python代码。

蟒蛇3

# convert  student Name  to list using map
print(dataframe.select('student Name').
      rdd.map(lambda x : x[0]).collect())
  
# convert  student ID  to list using map
print(dataframe.select('student ID').
      rdd.map(lambda x : x[0]).collect())
  
# convert  student college  to list using 
# map
print(dataframe.select('college').
      rdd.map(lambda x : x[0]).collect())

输出:



方法 3:使用 collect()

Collect 用于从数据框中收集数据,我们将使用理解数据结构通过 collect() 方法获取要列出的 pyspark 数据框列。

示例:使用 collect() 方法将数据框列转换为列表的Python代码

蟒蛇3

# display college column in
# the list format using comphrension
print([data[0] for data in dataframe.
       select('college').collect()])
  
  
# display student ID column in the
# list format using comphrension
print([data[0] for data in dataframe.
       select('student ID').collect()])
  
# display subject1  column in the list
# format using comphrension
print([data[0] for data in dataframe.
       select('subject1').collect()])
  
# display subject2  column in the
# list format using comphrension
print([data[0] for data in dataframe.
       select('subject2').collect()])

输出:

['vignan', 'vvit', 'vvit', 'vignan', 'vignan', 'iit']
['1', '2', '3', '4', '1', '5']
[67, 78, 100, 78, 89, 94]
[89, 89, 80, 80, 98, 98]

方法四:使用 toLocalIterator()

此方法用于迭代数据帧中的列值,我们将使用推导式数据结构通过 toLocalIterator() 方法获取要列出的 pyspark 数据帧列。



示例:使用 toLocalIterator() 方法将 pyspark 数据框列转换为列表

蟒蛇3

# display college column in the list
# format using comphrension
print([data[0] for data in dataframe.
       select('college').collect()])
  
  
# display student ID column in the
# list format using comphrension
print([data[0] for data in dataframe.
       select('student ID').toLocalIterator()])
  
# display subject1  column in the list
# format using comphrension
print([data[0] for data in dataframe.
       select('subject1').toLocalIterator()])
  
# display subject2  column in the
# list format using comphrension
print([data[0] for data in dataframe.
       select('subject2').toLocalIterator()])

输出:

['vignan', 'vvit', 'vvit', 'vignan', 'vignan', 'iit']
['1', '2', '3', '4', '1', '5']
[67, 78, 100, 78, 89, 94]
[89, 89, 80, 80, 98, 98]

方法 5:使用 toPandas()

用于将列转换为数据框,然后我们可以将其转换为列表。

示例:使用 toPandas() 方法将 pyspark 数据框列转换为列表

蟒蛇3



# display college  column in
# the list format using toPandas
print(list(dataframe.select('college').
           toPandas()['college']))
  
  
# display student NAME  column in
# the list format using toPandas
print(list(dataframe.select('student NAME').
           toPandas()['student NAME']))
  
# display subject1  column in
# the list format using toPandas
print(list(dataframe.select('subject1').
           toPandas()['subject1']))
  
# display subject2  column
# in the list format using toPandas
print(list(dataframe.select('subject2').
           toPandas()['subject2']))

输出: