📌  相关文章
📜  在Python中将 PySpark DataFrame 转换为字典

📅  最后修改于: 2022-05-13 01:54:32.710000             🧑  作者: Mango

在Python中将 PySpark DataFrame 转换为字典

在本文中,我们将看到如何将 PySpark 数据框转换为字典,其中键是列名,值是列值。

在开始之前,我们将创建一个示例数据框:

Python3
# Importing necessary libraries
from pyspark.sql import SparkSession
  
# Create a spark session
spark = SparkSession.builder.appName('DF_to_dict').getOrCreate()
  
# Create data in dataframe
data = [(('Ram'), '1991-04-01', 'M', 3000),
        (('Mike'), '2000-05-19', 'M', 4000),
        (('Rohini'), '1978-09-05', 'M', 4000),
        (('Maria'), '1967-12-01', 'F', 4000),
        (('Jenis'), '1980-02-17', 'F', 1200)]
  
# Column names in dataframe
columns = ["Name", "DOB", "Gender", "salary"]
  
# Create the spark dataframe
df = spark.createDataFrame(data=data,
                           schema=columns)
  
# Print the dataframe
df.show()


Python3
# Declare an empty Dictionary
dict = {}
  
# Convert PySpark DataFrame to Pandas 
# DataFrame
df = df.toPandas()
  
# Traverse through each column
for column in df.columns:
  
    # Add key as column_name and
    # value as list of column values
    dict[column] = df[column].values.tolist()
  
# Print the dictionary
print(dict)


Python3
import numpy as np
  
# Convert the dataframe into list
# of rows
rows = [list(row) for row in df.collect()]
  
# COnvert the list into numpy array
ar = np.array(rows)
  
# Declare an empty dictionary
dict = {}
  
# Get through each column
for i, column in enumerate(df.columns):
  
    # Add ith column as values in dict
    # with key as ith column_name
    dict[column] = list(ar[:, i])
  
# Print the dictionary
print(dict)


Python3
# COnvert PySpark dataframe to pandas
# dataframe
df = df.toPandas()
  
# Convert the dataframe into 
# dictionary
dict = df.to_dict(orient = 'list')
  
# Print the dictionary
print(dict)


Python3
# Importing necessary libraries
from pyspark.sql import SparkSession
  
# Create a spark session
spark = SparkSession.builder.appName('DF_to_dict').getOrCreate()
  
# Create data in dataframe
data = [(('Hyderabad'), 120000),
        (('Delhi'), 124000),
        (('Mumbai'), 344000),
        (('Guntur'), 454000),
        (('Bandra'), 111200)]
  
# Column names in dataframe
columns = ["Location", 'House_price']
  
# Create the spark dataframe
df = spark.createDataFrame(data=data, schema=columns)
  
# Print the dataframe
print('Dataframe : ')
df.show()
  
# COnvert PySpark dataframe to 
# pandas dataframe
df = df.toPandas()
  
# Convert the dataframe into 
# dictionary
dict = df.to_dict(orient='list')
  
# Print the dictionary
print('Dictionary :')
print(dict)


输出 :



方法一:使用 df.toPandas()

使用 df.toPandas() 将 PySpark 数据框转换为 Pandas 数据框。

遍历每个列值并将值列表添加到以列名作为键的字典中。

蟒蛇3

# Declare an empty Dictionary
dict = {}
  
# Convert PySpark DataFrame to Pandas 
# DataFrame
df = df.toPandas()
  
# Traverse through each column
for column in df.columns:
  
    # Add key as column_name and
    # value as list of column values
    dict[column] = df[column].values.tolist()
  
# Print the dictionary
print(dict)

输出 :

方法 2:使用 df.collect()

将 PySpark 数据框转换为行列表,并以列表形式返回一个数据框的所有记录。

蟒蛇3

import numpy as np
  
# Convert the dataframe into list
# of rows
rows = [list(row) for row in df.collect()]
  
# COnvert the list into numpy array
ar = np.array(rows)
  
# Declare an empty dictionary
dict = {}
  
# Get through each column
for i, column in enumerate(df.columns):
  
    # Add ith column as values in dict
    # with key as ith column_name
    dict[column] = list(ar[:, i])
  
# Print the dictionary
print(dict)

输出 :

方法 3:使用 pandas.DataFrame.to_dict()

可以使用to_dict()方法直接将 Pandas 数据框转换为字典

代码:

蟒蛇3

# COnvert PySpark dataframe to pandas
# dataframe
df = df.toPandas()
  
# Convert the dataframe into 
# dictionary
dict = df.to_dict(orient = 'list')
  
# Print the dictionary
print(dict)

输出 :

将具有 2 列的数据框转换为字典,创建一个具有2 列名为“位置”和“房屋价格”数据框

蟒蛇3

# Importing necessary libraries
from pyspark.sql import SparkSession
  
# Create a spark session
spark = SparkSession.builder.appName('DF_to_dict').getOrCreate()
  
# Create data in dataframe
data = [(('Hyderabad'), 120000),
        (('Delhi'), 124000),
        (('Mumbai'), 344000),
        (('Guntur'), 454000),
        (('Bandra'), 111200)]
  
# Column names in dataframe
columns = ["Location", 'House_price']
  
# Create the spark dataframe
df = spark.createDataFrame(data=data, schema=columns)
  
# Print the dataframe
print('Dataframe : ')
df.show()
  
# COnvert PySpark dataframe to 
# pandas dataframe
df = df.toPandas()
  
# Convert the dataframe into 
# dictionary
dict = df.to_dict(orient='list')
  
# Print the dictionary
print('Dictionary :')
print(dict)

输出 :