📌  相关文章
📜  如何将 Pandas 转换为 PySpark DataFrame?

📅  最后修改于: 2022-05-13 01:54:42.443000             🧑  作者: Mango

如何将 Pandas 转换为 PySpark DataFrame?

在本文中,我们将学习如何将 Pandas 转换为 PySpark DataFrame。有时我们会得到 csv、xlsx 等格式的数据,我们必须将其存储在 PySpark DataFrame 中,这可以通过在 Pandas 中加载数据然后转换 PySpark DataFrame 来完成。为了进行转换,我们将 Pandas 数据帧传递给 CreateDataFrame() 方法。

示例 1:创建一个 DataFrame,然后使用 spark.createDataFrame() 方法进行转换

Python3
# import the pandas
import pandas as pd
  
# from  pyspark library import 
# SparkSession
from pyspark.sql import SparkSession
  
# Building the SparkSession and name
# it :'pandas to spark'
spark = SparkSession.builder.appName(
  "pandas to spark").getOrCreate()
  
# Create the DataFrame with the help 
# of pd.DataFrame()
data = pd.DataFrame({'State': ['Alaska', 'California',
                               'Florida', 'Washington'],
                       
                     'city': ["Anchorage", "Los Angeles", 
                              "Miami", "Bellevue"]})
  
# create DataFrame
df_spark = spark.createDataFrame(data)
  
df_spark.show()


Python3
import the pandas
import pandas as pd
  
# from  pyspark library import 
# SparkSession
from pyspark.sql import SparkSession
  
# Building the SparkSession and name 
# it :'pandas to spark'
spark = SparkSession.builder.appName(
  "pandas to spark").getOrCreate()
  
# Create the DataFrame with the help 
# of pd.DataFrame()
data = pd.DataFrame({'State': ['Alaska', 'California',
                               'Florida', 'Washington'],
                       
                     'city': ["Anchorage", "Los Angeles",
                              "Miami", "Bellevue"]})
  
  
# enableing the Apache Arrow for converting
# Pandas to pySpark DF(DataFrame)
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
  
# Creating the DataFrame
sprak_arrow = spark.createDataFrame(data)
  
# Show the DataFrame
sprak_arrow.show()


Python3
# import the pandas library
import pandas as pd
  
# Read the Dataset in Pandas Dataframe
df_pd = pd.read_csv('heart.csv')
  
# Show the dataset here head() 
# will return top 5 rows
df_pd.head()


Python3
# Reading the csv file in 
# Pyspark DataFrame
df_spark2 = spark.read.option(
  'header', 'true').csv("heart.csv")
  
# Showing the data in the from of 
# table and showing only top 5 rows
df_spark2.show(5)


Python3
# Convert Pyspark DataFrame to 
# Pandas DataFrame by toPandas() 
# Function head() will show only
# top 5 rows of the dataset
df_spark2.toPandas().head()


输出:



示例 2:创建一个 DataFrame,然后使用 spark.createDataFrame() 方法进行转换

在这种方法中,我们使用 Apache Arrow 将 Pandas 转换为 Pyspark DataFrame。

蟒蛇3

import the pandas
import pandas as pd
  
# from  pyspark library import 
# SparkSession
from pyspark.sql import SparkSession
  
# Building the SparkSession and name 
# it :'pandas to spark'
spark = SparkSession.builder.appName(
  "pandas to spark").getOrCreate()
  
# Create the DataFrame with the help 
# of pd.DataFrame()
data = pd.DataFrame({'State': ['Alaska', 'California',
                               'Florida', 'Washington'],
                       
                     'city': ["Anchorage", "Los Angeles",
                              "Miami", "Bellevue"]})
  
  
# enableing the Apache Arrow for converting
# Pandas to pySpark DF(DataFrame)
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
  
# Creating the DataFrame
sprak_arrow = spark.createDataFrame(data)
  
# Show the DataFrame
sprak_arrow.show()

输出:

示例 3:从 CSV 加载数据帧然后转换

在这种方法中,我们可以轻松读取 Pandas Dataframe 和 Pyspark Dataframe 中的 CSV 文件。这里使用的数据集是 heart.csv。

蟒蛇3



# import the pandas library
import pandas as pd
  
# Read the Dataset in Pandas Dataframe
df_pd = pd.read_csv('heart.csv')
  
# Show the dataset here head() 
# will return top 5 rows
df_pd.head()

输出:

蟒蛇3

# Reading the csv file in 
# Pyspark DataFrame
df_spark2 = spark.read.option(
  'header', 'true').csv("heart.csv")
  
# Showing the data in the from of 
# table and showing only top 5 rows
df_spark2.show(5)

输出:

我们还可以将 pyspark Dataframe 转换为 pandas Dataframe。为此,我们将使用 DataFrame.toPandas() 方法。

蟒蛇3

# Convert Pyspark DataFrame to 
# Pandas DataFrame by toPandas() 
# Function head() will show only
# top 5 rows of the dataset
df_spark2.toPandas().head()

输出: