如何将 Pandas 转换为 PySpark DataFrame?
在本文中,我们将学习如何将 Pandas 转换为 PySpark DataFrame。有时我们会得到 csv、xlsx 等格式的数据,我们必须将其存储在 PySpark DataFrame 中,这可以通过在 Pandas 中加载数据然后转换 PySpark DataFrame 来完成。为了进行转换,我们将 Pandas 数据帧传递给 CreateDataFrame() 方法。
Syntax: spark.createDataframe(data, schema)
Parameter:
- data – list of values on which dataframe is created.
- schema – It’s the structure of dataset or list of column names.
where spark is the SparkSession object.
示例 1:创建一个 DataFrame,然后使用 spark.createDataFrame() 方法进行转换
Python3
# import the pandas
import pandas as pd
# from pyspark library import
# SparkSession
from pyspark.sql import SparkSession
# Building the SparkSession and name
# it :'pandas to spark'
spark = SparkSession.builder.appName(
"pandas to spark").getOrCreate()
# Create the DataFrame with the help
# of pd.DataFrame()
data = pd.DataFrame({'State': ['Alaska', 'California',
'Florida', 'Washington'],
'city': ["Anchorage", "Los Angeles",
"Miami", "Bellevue"]})
# create DataFrame
df_spark = spark.createDataFrame(data)
df_spark.show()
Python3
import the pandas
import pandas as pd
# from pyspark library import
# SparkSession
from pyspark.sql import SparkSession
# Building the SparkSession and name
# it :'pandas to spark'
spark = SparkSession.builder.appName(
"pandas to spark").getOrCreate()
# Create the DataFrame with the help
# of pd.DataFrame()
data = pd.DataFrame({'State': ['Alaska', 'California',
'Florida', 'Washington'],
'city': ["Anchorage", "Los Angeles",
"Miami", "Bellevue"]})
# enableing the Apache Arrow for converting
# Pandas to pySpark DF(DataFrame)
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
# Creating the DataFrame
sprak_arrow = spark.createDataFrame(data)
# Show the DataFrame
sprak_arrow.show()
Python3
# import the pandas library
import pandas as pd
# Read the Dataset in Pandas Dataframe
df_pd = pd.read_csv('heart.csv')
# Show the dataset here head()
# will return top 5 rows
df_pd.head()
Python3
# Reading the csv file in
# Pyspark DataFrame
df_spark2 = spark.read.option(
'header', 'true').csv("heart.csv")
# Showing the data in the from of
# table and showing only top 5 rows
df_spark2.show(5)
Python3
# Convert Pyspark DataFrame to
# Pandas DataFrame by toPandas()
# Function head() will show only
# top 5 rows of the dataset
df_spark2.toPandas().head()
输出:
示例 2:创建一个 DataFrame,然后使用 spark.createDataFrame() 方法进行转换
在这种方法中,我们使用 Apache Arrow 将 Pandas 转换为 Pyspark DataFrame。
蟒蛇3
import the pandas
import pandas as pd
# from pyspark library import
# SparkSession
from pyspark.sql import SparkSession
# Building the SparkSession and name
# it :'pandas to spark'
spark = SparkSession.builder.appName(
"pandas to spark").getOrCreate()
# Create the DataFrame with the help
# of pd.DataFrame()
data = pd.DataFrame({'State': ['Alaska', 'California',
'Florida', 'Washington'],
'city': ["Anchorage", "Los Angeles",
"Miami", "Bellevue"]})
# enableing the Apache Arrow for converting
# Pandas to pySpark DF(DataFrame)
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
# Creating the DataFrame
sprak_arrow = spark.createDataFrame(data)
# Show the DataFrame
sprak_arrow.show()
输出:
示例 3:从 CSV 加载数据帧然后转换
在这种方法中,我们可以轻松读取 Pandas Dataframe 和 Pyspark Dataframe 中的 CSV 文件。这里使用的数据集是 heart.csv。
蟒蛇3
# import the pandas library
import pandas as pd
# Read the Dataset in Pandas Dataframe
df_pd = pd.read_csv('heart.csv')
# Show the dataset here head()
# will return top 5 rows
df_pd.head()
输出:
蟒蛇3
# Reading the csv file in
# Pyspark DataFrame
df_spark2 = spark.read.option(
'header', 'true').csv("heart.csv")
# Showing the data in the from of
# table and showing only top 5 rows
df_spark2.show(5)
输出:
我们还可以将 pyspark Dataframe 转换为 pandas Dataframe。为此,我们将使用 DataFrame.toPandas() 方法。
Syntax: DataFrame.toPandas()
Returns the contents of this DataFrame as Pandas pandas.DataFrame.
蟒蛇3
# Convert Pyspark DataFrame to
# Pandas DataFrame by toPandas()
# Function head() will show only
# top 5 rows of the dataset
df_spark2.toPandas().head()
输出: