如何使用模式创建 PySpark 数据框?
在本文中,我们将讨论如何使用 PySpark 创建带有模式的数据帧。简而言之,模式是数据集或数据框的结构。
使用的功能:
Function | Description |
---|---|
SparkSession | The entry point to the Spark SQL. |
SparkSession.builder() | It gives access to Builder API that we used to configure session |
SparkSession.master(local) | It sets the Spark Master URL to connect to run locally. |
SparkSession.appname() | Is sets the name for the application. |
SparkSession.getOrCreate() | If there is no existing Spark Session then it creates a new one otherwise use the existing one. |
为了使用我们正在使用的模式创建数据框:
Syntax: spark.createDataframe(data,schema)
Parameter:
- data – list of values on which dataframe is created.
- schema – It’s the structure of dataset or list of column names.
where spark is the SparkSession object.
示例 1:
- 在下面的代码中,我们创建了一个名为“spark”的新 Spark Session 对象。
- 然后我们创建了数据值并将它们存储在名为“data”的变量中以创建数据帧。
- 然后我们定义了数据帧的模式并将其存储在名为“schm”的变量中。
- 然后我们使用 createDataframe()函数创建了数据帧,我们在其中传递了数据和数据帧的模式。
- 由于数据框是为可视化而创建的,我们使用了 show()函数。
Python
# importing necessary libraries
from pyspark.sql import SparkSession
# function to create new SparkSession
def create_session():
spk = SparkSession.builder \
.master("local") \
.appName("Geek_examples.com") \
.getOrCreate()
return spk
# main function
if __name__ == "__main__":
# calling function to create SparkSession
spark = create_session()
# creating data for creating dataframe
data = [
("Shivansh","M",50000,2),
("Vaishali","F",45000,3),
("Karan","M",47000,2),
("Satyam","M",40000,4),
("Anupma","F",35000,5)
]
# giving schema
schm=["Name of employee","Gender","Salary","Years of experience"]
# creating dataframe using createDataFrame()
# function in which pass data and schema
df = spark.createDataFrame(data,schema=schm)
# visualizing the dataframe using show() function
df.show()
Python
# importing necessary libraries
from pyspark.sql import SparkSession
# function to create new SparkSession
def create_session():
spk = SparkSession.builder \
.master("local") \
.appName("Geek_examples.com") \
.getOrCreate()
return spk
# main function
if __name__ == "__main__":
# calling function to create SparkSession
spark = create_session()
# creating dataframe using createDataFrame()
# function in which pass data and schema
df = spark.createDataFrame([
("Mazda RX4",21,4,4),
("Hornet 4 Drive",22,3,2),
("Merc 240D",25,4,2),
("Lotus Europa",31,5,2),
("Ferrari Dino",20,5,6),
("Volvo 142E",22,4,2)
],["Car Name","mgp","gear","carb"])
# visualizing the dataframe using show() function
df.show()
输出:
示例 2:
在下面的代码中,我们通过直接在 createDataframe()函数传递数据和模式来创建数据帧。
Python
# importing necessary libraries
from pyspark.sql import SparkSession
# function to create new SparkSession
def create_session():
spk = SparkSession.builder \
.master("local") \
.appName("Geek_examples.com") \
.getOrCreate()
return spk
# main function
if __name__ == "__main__":
# calling function to create SparkSession
spark = create_session()
# creating dataframe using createDataFrame()
# function in which pass data and schema
df = spark.createDataFrame([
("Mazda RX4",21,4,4),
("Hornet 4 Drive",22,3,2),
("Merc 240D",25,4,2),
("Lotus Europa",31,5,2),
("Ferrari Dino",20,5,6),
("Volvo 142E",22,4,2)
],["Car Name","mgp","gear","carb"])
# visualizing the dataframe using show() function
df.show()
输出: