PySpark – 从两列数据创建字典
在本文中,我们将了解如何使用Python从 PySpark 中两列中的数据创建字典。
方法 1:使用字典理解
在这里,我们将创建具有两列的数据框,然后使用字典推导将其转换为字典。
Python
# importing pyspark
# make sure you have installed the pyspark library
import pyspark
# Importing and creating a SparkSession
# to work on DataFrames
# The session name is 'Practice_Session'
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.appName(
'Practice_Session').getOrCreate()
# Creating a DataFrame using createDataFrame()
# method, with hard coded data.
rows = [['John', 54],
['Adam', 65],
['Michael', 56],
['Kelsey', 37],
['Chris', 49],
['Jonathan', 28],
['Anthony', 26],
['Esther', 48],
['Rachel', 52],
['Joseph', 56],
['Richard', 49],
]
columns = ['Name', 'Age']
df_pyspark = spark_session.createDataFrame(rows, columns)
# printing the DataFrame
df_pyspark.show()
# dictionary comprehension is used here
# Name column here is the key while Age
# columns is the value
# You can also use {row['Age']:row['Name']
# for row in df_pyspark.collect()},
# to reverse the key,value pairs
# collect() gives a list of
# rows in the DataFrame
result_dict = {row['Name']: row['Age']
for row in df_pyspark.collect()}
# Printing a few key:value pairs of
# our final resultant dictionary
print(result_dict['John'])
print(result_dict['Michael'])
print(result_dict['Adam'])
Python
# importing pyspark
# make sure you have installed
# the pyspark library
import pyspark
# Importing and creating a SparkSession
# to work on DataFrames
# The session name is 'Practice_Session'
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.appName(
'Practice_Session').getOrCreate()
# Creating a DataFrame using createDataFrame()
# method, with hard coded data.
rows = [['John', 54],
['Adam', 65],
['Michael', 56],
['Kelsey', 37],
['Chris', 49],
['Jonathan', 28],
['Anthony', 26],
['Esther', 48],
['Rachel', 52],
['Joseph', 56],
['Richard', 49],
]
columns = ['Name', 'Age']
df_pyspark = spark_session.createDataFrame(rows, columns)
# printing the DataFrame
df_pyspark.show()
# COnvert PySpark dataframe to pandas
# dataframe
df_pandas = df_pyspark.toPandas()
# Convert the dataframe into
# dictionary
result = df_pandas.to_dict(orient='list')
# Print the dictionary
print(result)
Python
# importing pyspark
# make sure you have installed the pyspark library
import pyspark
# Importing and creating a SparkSession to work on
# DataFrames The session name is 'Practice_Session'
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.appName(
'Practice_Session').getOrCreate()
# Creating a DataFrame using createDataFrame()
# method, with hard coded data.
rows = [['John', 54],
['Adam', 65],
['Michael', 56],
['Kelsey', 37],
['Chris', 49],
['Jonathan', 28],
['Anthony', 26],
['Esther', 48],
['Rachel', 52],
['Joseph', 56],
['Richard', 49],
]
columns = ['Name', 'Age']
df_pyspark = spark_session.createDataFrame(rows, columns)
# printing the DataFrame
df_pyspark.show()
result = {}
# Convert PySpark DataFrame to Pandas
# DataFrame
df_pandas = df_pyspark.toPandas()
# Traverse through each column
for column in df_pandas.columns:
# Add key as column_name and
# value as list of column values
result[column] = df_pandas[column].values.tolist()
# Print the dictionary
print(result)
输出 :
方法二:转换 PySpark DataFrame 并使用 to_dict() 方法
以下是 to_dict() 方法的详细信息:
to_dict() : PandasDataFrame.to_dict(orient=’dict’)
Parameters:
- orient : str {‘dict’, ‘list’, ‘series’, ‘split’, ‘records’, ‘index’}
- Determines the type of the values of the dictionary.
Return: It returns a Python dictionary corresponding to the DataFrame
Python
# importing pyspark
# make sure you have installed
# the pyspark library
import pyspark
# Importing and creating a SparkSession
# to work on DataFrames
# The session name is 'Practice_Session'
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.appName(
'Practice_Session').getOrCreate()
# Creating a DataFrame using createDataFrame()
# method, with hard coded data.
rows = [['John', 54],
['Adam', 65],
['Michael', 56],
['Kelsey', 37],
['Chris', 49],
['Jonathan', 28],
['Anthony', 26],
['Esther', 48],
['Rachel', 52],
['Joseph', 56],
['Richard', 49],
]
columns = ['Name', 'Age']
df_pyspark = spark_session.createDataFrame(rows, columns)
# printing the DataFrame
df_pyspark.show()
# COnvert PySpark dataframe to pandas
# dataframe
df_pandas = df_pyspark.toPandas()
# Convert the dataframe into
# dictionary
result = df_pandas.to_dict(orient='list')
# Print the dictionary
print(result)
输出 :
方法3:通过遍历一列字典
遍历列并生成一个字典,使得键是列,值是列中的值列表。
为此,我们需要先将 PySpark DataFrame 转换为 Pandas DataFrame
Python
# importing pyspark
# make sure you have installed the pyspark library
import pyspark
# Importing and creating a SparkSession to work on
# DataFrames The session name is 'Practice_Session'
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.appName(
'Practice_Session').getOrCreate()
# Creating a DataFrame using createDataFrame()
# method, with hard coded data.
rows = [['John', 54],
['Adam', 65],
['Michael', 56],
['Kelsey', 37],
['Chris', 49],
['Jonathan', 28],
['Anthony', 26],
['Esther', 48],
['Rachel', 52],
['Joseph', 56],
['Richard', 49],
]
columns = ['Name', 'Age']
df_pyspark = spark_session.createDataFrame(rows, columns)
# printing the DataFrame
df_pyspark.show()
result = {}
# Convert PySpark DataFrame to Pandas
# DataFrame
df_pandas = df_pyspark.toPandas()
# Traverse through each column
for column in df_pandas.columns:
# Add key as column_name and
# value as list of column values
result[column] = df_pandas[column].values.tolist()
# Print the dictionary
print(result)
输出 :