如何从 PySpark Dataframe 中的 Row 对象获取值?
在本文中,我们将学习如何从 PySpark DataFrame 中的 Row 对象中获取值。
方法一:使用 __getitem()__ 魔术方法
我们将使用createDataFrame() 创建一个至少包含一行的 Spark DataFrame。然后我们从DataFrame.collect() 返回的行对象列表中获取一个 Row 对象。然后我们使用__getitem()__魔术方法来获取特定列名的项目。下面给出的是语法。
Syntax : DataFrame.__getitem__(‘Column_Name’)
Returns : value corresponding to the column name in the Row object
Python
# library import
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import Row
# Session Creation
random_value_session = SparkSession.builder.appName(
'Random_Value_Session'
).getOrCreate()
# Data filled in our DataFrame
# 5 rows below
rows = [['All England Open', 'March', 'Super 1000'],
['Malaysia Open', 'January', 'Super 750'],
['Korea Open', 'April', 'Super 500'],
['Hylo Open', 'November', 'Super 100'],
['Spain Masters', 'March', 'Super 300']]
# Columns of our DataFrame
columns = ['Tournament', 'Month', 'Level']
#DataFrame is created
dataframe = random_value_session.createDataFrame(rows,
columns)
# Showing the DataFrame
dataframe.show()
# getting list of rows using collect()
row_list = dataframe.collect()
# Printing the first Row object
# from which data is extracted
print(row_list[0])
# Using __getitem__() magic method
# To get value corresponding to a particular
# column
print(row_list[0].__getitem__('Level'))
print(row_list[0].__getitem__('Tournament'))
print(row_list[0].__getitem__('Level'))
print(row_list[0].__getitem__('Month'))
Python
# library imports are done here
import pyspark
from pyspark.sql import SparkSession
# Session Creation
random_value_session = SparkSession.builder.appName(
'Random_Value_Session'
).getOrCreate()
# Data filled in our DataFrame
# Rows below will be filled
rows = [['French Open', 'October', 'Super 750'],
['Macau Open', 'November', 'Super 300'],
['India Open', 'January', 'Super 500'],
['Odisha Open', 'January', 'Super 100'],
['China Open', 'November', 'Super 1000']]
# DataFrame Columns
columns = ['Tournament', 'Month', 'Level']
# DataFrame creation
dataframe = random_value_session.createDataFrame(rows,
columns)
# DataFrame print
dataframe.show()
# list of rows using collect()
row_list = dataframe.collect()
# Printing the second Row object
# from which we will read data
print(row_list[1])
print()
# Printing dictionary to make
# things more clear
print(row_list[1].asDict())
print()
# Using asDict() method to convert row object
# into a dictionary where the column names are keys
# Using column names as keys to get respective values
print(row_list[1].asDict()['Tournament'])
print(row_list[1].asDict()['Month'])
print(row_list[1].asDict()['Level'])
Python
# library imports are done here
import pyspark
from pyspark.sql import SparkSession
# Session Creation
random_value_session = SparkSession.builder.appName(
'Random_Value_Session'
).getOrCreate()
# Data filled in our DataFrame
# Rows below will be filled
rows = [['Denmark Open', 'October', 'Super 1000'],
['Indonesia Open', 'June', 'Super 1000'],
['Korea Open', 'April', 'Super 500'],
['Japan Open', 'August', 'Super 750'],
['Akita Masters', 'July', 'Super 100']]
# DataFrame Columns
columns = ['Tournament', 'Month', 'Level']
# DataFrame creation
dataframe = random_value_session.createDataFrame(rows,
columns)
# DataFrame print
dataframe.show()
# list of rows using collect()
row_list = dataframe.collect()
# Lets take the third Row object
row_object = row_list[2]
# If we imagine it as a Python List,
# We can get the first value of the list,
# index 0, let's try it
print(row_object[0])
# We got the value of column at index 0
# which is - 'Tournament'
# A few more examples
print(row_list[4][0])
print(row_list[3][1])
print(row_list[4][2])
输出:
+----------------+--------+----------+
| Tournament| Month| Level|
+----------------+--------+----------+
|All England Open| March|Super 1000|
| Malaysia Open| January| Super 750|
| Korea Open| April| Super 500|
| Hylo Open|November| Super 100|
| Spain Masters| March| Super 300|
+----------------+--------+----------+
Row(Tournament='All England Open', Month='March', Level='Super 1000')
Super 1000
All England Open
Super 1000
March
方法 2:使用 asDict() 方法
我们将使用createDataFrame()创建一个至少包含一行的 Spark DataFrame。然后,我们从DataFrame.collect()返回的行对象列表中获取一个 Row 对象。然后我们使用asDict()方法获取一个字典,其中列名是键,它们的行值是字典值。下面给出了语法:
Syntax : DataFrame.asDict(recursive)
Parameters :
recursive: bool : returns nested rows as dict. The default value is False.
然后,我们可以使用DictionaryName['key_name'] 轻松地从字典中获取值。
Python
# library imports are done here
import pyspark
from pyspark.sql import SparkSession
# Session Creation
random_value_session = SparkSession.builder.appName(
'Random_Value_Session'
).getOrCreate()
# Data filled in our DataFrame
# Rows below will be filled
rows = [['French Open', 'October', 'Super 750'],
['Macau Open', 'November', 'Super 300'],
['India Open', 'January', 'Super 500'],
['Odisha Open', 'January', 'Super 100'],
['China Open', 'November', 'Super 1000']]
# DataFrame Columns
columns = ['Tournament', 'Month', 'Level']
# DataFrame creation
dataframe = random_value_session.createDataFrame(rows,
columns)
# DataFrame print
dataframe.show()
# list of rows using collect()
row_list = dataframe.collect()
# Printing the second Row object
# from which we will read data
print(row_list[1])
print()
# Printing dictionary to make
# things more clear
print(row_list[1].asDict())
print()
# Using asDict() method to convert row object
# into a dictionary where the column names are keys
# Using column names as keys to get respective values
print(row_list[1].asDict()['Tournament'])
print(row_list[1].asDict()['Month'])
print(row_list[1].asDict()['Level'])
输出 :
+-----------+--------+----------+
| Tournament| Month| Level|
+-----------+--------+----------+
|French Open| October| Super 750|
| Macau Open|November| Super 300|
| India Open| January| Super 500|
|Odisha Open| January| Super 100|
| China Open|November|Super 1000|
+-----------+--------+----------+
Row(Tournament='Macau Open', Month='November', Level='Super 300')
{'Tournament': 'Macau Open', 'Month': 'November', 'Level': 'Super 300'}
Macau Open
November
Super 300
方法 3:将 Row 对象想象成列表
在这里,我们将想象一个类似于Python List 的 Row 对象并执行操作。我们将使用createDataFrame()创建一个至少包含一行的 Spark DataFrame。然后,我们从DataFrame.collect()返回的行对象列表中获取一个 Row 对象。由于我们将 Row 对象想象成一个列表,因此我们只需使用:
Syntax : RowObject[‘Column_name’]
Returns : Value corresponding to the column name in the row object.
Python
# library imports are done here
import pyspark
from pyspark.sql import SparkSession
# Session Creation
random_value_session = SparkSession.builder.appName(
'Random_Value_Session'
).getOrCreate()
# Data filled in our DataFrame
# Rows below will be filled
rows = [['Denmark Open', 'October', 'Super 1000'],
['Indonesia Open', 'June', 'Super 1000'],
['Korea Open', 'April', 'Super 500'],
['Japan Open', 'August', 'Super 750'],
['Akita Masters', 'July', 'Super 100']]
# DataFrame Columns
columns = ['Tournament', 'Month', 'Level']
# DataFrame creation
dataframe = random_value_session.createDataFrame(rows,
columns)
# DataFrame print
dataframe.show()
# list of rows using collect()
row_list = dataframe.collect()
# Lets take the third Row object
row_object = row_list[2]
# If we imagine it as a Python List,
# We can get the first value of the list,
# index 0, let's try it
print(row_object[0])
# We got the value of column at index 0
# which is - 'Tournament'
# A few more examples
print(row_list[4][0])
print(row_list[3][1])
print(row_list[4][2])
输出:
+--------------+-------+----------+
| Tournament| Month| Level|
+--------------+-------+----------+
| Denmark Open|October|Super 1000|
|Indonesia Open| June|Super 1000|
| Korea Open| April| Super 500|
| Japan Open| August| Super 750|
| Akita Masters| July| Super 100|
+--------------+-------+----------+
Korea Open
Akita Masters
August
Super 100