如何将 Pandas 转换为 PySpark DataFrame？

在本文中，我们将学习如何将 Pandas 转换为 PySpark DataFrame。有时我们会得到 csv、xlsx 等格式的数据，我们必须将其存储在 PySpark DataFrame 中，这可以通过在 Pandas 中加载数据然后转换 PySpark DataFrame 来完成。为了进行转换，我们将 Pandas 数据帧传递给 CreateDataFrame() 方法。

Syntax: spark.createDataframe(data, schema)

Parameter:

data – list of values on which dataframe is created.
schema – It’s the structure of dataset or list of column names.

where spark is the SparkSession object.

编程需要懂一点英语

示例 1：创建一个 DataFrame，然后使用 spark.createDataFrame() 方法进行转换

Python3

# import the pandas
import pandas as pd
  
# from  pyspark library import 
# SparkSession
from pyspark.sql import SparkSession
  
# Building the SparkSession and name
# it :'pandas to spark'
spark = SparkSession.builder.appName(
  "pandas to spark").getOrCreate()
  
# Create the DataFrame with the help 
# of pd.DataFrame()
data = pd.DataFrame({'State': ['Alaska', 'California',
                               'Florida', 'Washington'],
                       
                     'city': ["Anchorage", "Los Angeles", 
                              "Miami", "Bellevue"]})
  
# create DataFrame
df_spark = spark.createDataFrame(data)
  
df_spark.show()

Python3

import the pandas
import pandas as pd
  
# from  pyspark library import 
# SparkSession
from pyspark.sql import SparkSession
  
# Building the SparkSession and name 
# it :'pandas to spark'
spark = SparkSession.builder.appName(
  "pandas to spark").getOrCreate()
  
# Create the DataFrame with the help 
# of pd.DataFrame()
data = pd.DataFrame({'State': ['Alaska', 'California',
                               'Florida', 'Washington'],
                       
                     'city': ["Anchorage", "Los Angeles",
                              "Miami", "Bellevue"]})
  
  
# enableing the Apache Arrow for converting
# Pandas to pySpark DF(DataFrame)
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
  
# Creating the DataFrame
sprak_arrow = spark.createDataFrame(data)
  
# Show the DataFrame
sprak_arrow.show()

Python3

# import the pandas library
import pandas as pd
  
# Read the Dataset in Pandas Dataframe
df_pd = pd.read_csv('heart.csv')
  
# Show the dataset here head() 
# will return top 5 rows
df_pd.head()

Python3

# Reading the csv file in 
# Pyspark DataFrame
df_spark2 = spark.read.option(
  'header', 'true').csv("heart.csv")
  
# Showing the data in the from of 
# table and showing only top 5 rows
df_spark2.show(5)

Python3

# Convert Pyspark DataFrame to 
# Pandas DataFrame by toPandas() 
# Function head() will show only
# top 5 rows of the dataset
df_spark2.toPandas().head()

输出：

示例 2：创建一个 DataFrame，然后使用 spark.createDataFrame() 方法进行转换

在这种方法中，我们使用 Apache Arrow 将 Pandas 转换为 Pyspark DataFrame。

蟒蛇3

import the pandas
import pandas as pd
  
# from  pyspark library import 
# SparkSession
from pyspark.sql import SparkSession
  
# Building the SparkSession and name 
# it :'pandas to spark'
spark = SparkSession.builder.appName(
  "pandas to spark").getOrCreate()
  
# Create the DataFrame with the help 
# of pd.DataFrame()
data = pd.DataFrame({'State': ['Alaska', 'California',
                               'Florida', 'Washington'],
                       
                     'city': ["Anchorage", "Los Angeles",
                              "Miami", "Bellevue"]})
  
  
# enableing the Apache Arrow for converting
# Pandas to pySpark DF(DataFrame)
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
  
# Creating the DataFrame
sprak_arrow = spark.createDataFrame(data)
  
# Show the DataFrame
sprak_arrow.show()

输出：

示例 3：从 CSV 加载数据帧然后转换

在这种方法中，我们可以轻松读取 Pandas Dataframe 和 Pyspark Dataframe 中的 CSV 文件。这里使用的数据集是 heart.csv。

蟒蛇3

# import the pandas library
import pandas as pd
  
# Read the Dataset in Pandas Dataframe
df_pd = pd.read_csv('heart.csv')
  
# Show the dataset here head() 
# will return top 5 rows
df_pd.head()

输出：

蟒蛇3

# Reading the csv file in 
# Pyspark DataFrame
df_spark2 = spark.read.option(
  'header', 'true').csv("heart.csv")
  
# Showing the data in the from of 
# table and showing only top 5 rows
df_spark2.show(5)

输出：

我们还可以将 pyspark Dataframe 转换为 pandas Dataframe。为此，我们将使用 DataFrame.toPandas() 方法。

Syntax: DataFrame.toPandas()

Returns the contents of this DataFrame as Pandas pandas.DataFrame.

编程需要懂一点英语

蟒蛇3

# Convert Pyspark DataFrame to 
# Pandas DataFrame by toPandas() 
# Function head() will show only
# top 5 rows of the dataset
df_spark2.toPandas().head()

输出：