如何使用模式创建 PySpark 数据框？

在本文中，我们将讨论如何使用 PySpark 创建带有模式的数据帧。简而言之，模式是数据集或数据框的结构。

使用的功能：

Function	Description
SparkSession	The entry point to the Spark SQL.
SparkSession.builder()	It gives access to Builder API that we used to configure session
SparkSession.master(local)	It sets the Spark Master URL to connect to run locally.
SparkSession.appname()	Is sets the name for the application.
SparkSession.getOrCreate()	If there is no existing Spark Session then it creates a new one otherwise use the existing one.

为了使用我们正在使用的模式创建数据框：

Syntax: spark.createDataframe(data,schema)

Parameter:

data – list of values on which dataframe is created.
schema – It’s the structure of dataset or list of column names.

where spark is the SparkSession object.

编程需要懂一点英语

示例 1：

在下面的代码中，我们创建了一个名为“spark”的新 Spark Session 对象。
然后我们创建了数据值并将它们存储在名为“data”的变量中以创建数据帧。
然后我们定义了数据帧的模式并将其存储在名为“schm”的变量中。
然后我们使用 createDataframe()函数创建了数据帧，我们在其中传递了数据和数据帧的模式。
由于数据框是为可视化而创建的，我们使用了 show()函数。

Python

# importing necessary libraries
from pyspark.sql import SparkSession
  
# function to create new SparkSession
def create_session():
  spk = SparkSession.builder \
      .master("local") \
      .appName("Geek_examples.com") \
      .getOrCreate()
  return spk
  
# main function
if __name__ == "__main__":
  
  # calling function to create SparkSession
  spark = create_session()
  
  #  creating data for creating dataframe 
  data = [
    ("Shivansh","M",50000,2),
    ("Vaishali","F",45000,3),
    ("Karan","M",47000,2),
    ("Satyam","M",40000,4),
    ("Anupma","F",35000,5)
  ]
  
  # giving schema
  schm=["Name of employee","Gender","Salary","Years of experience"]
  
  # creating dataframe using createDataFrame()
  # function in which pass data and schema
  df = spark.createDataFrame(data,schema=schm)
  
  # visualizing the dataframe using show() function
  df.show()

Python

# importing necessary libraries
from pyspark.sql import SparkSession
  
# function to create new SparkSession
def create_session():
  spk = SparkSession.builder \
      .master("local") \
      .appName("Geek_examples.com") \
      .getOrCreate()
  return spk
  
# main function
if __name__ == "__main__":
  
  # calling function to create SparkSession
  spark = create_session()
  
  # creating dataframe using createDataFrame() 
  # function in which pass data and schema
  df = spark.createDataFrame([
    ("Mazda RX4",21,4,4),
    ("Hornet 4 Drive",22,3,2),
    ("Merc 240D",25,4,2),
    ("Lotus Europa",31,5,2),
    ("Ferrari Dino",20,5,6),
    ("Volvo 142E",22,4,2)
  ],["Car Name","mgp","gear","carb"])
  
  # visualizing the dataframe using show() function
  df.show()

输出：

示例 2：

在下面的代码中，我们通过直接在 createDataframe()函数传递数据和模式来创建数据帧。

Python

# importing necessary libraries
from pyspark.sql import SparkSession
  
# function to create new SparkSession
def create_session():
  spk = SparkSession.builder \
      .master("local") \
      .appName("Geek_examples.com") \
      .getOrCreate()
  return spk
  
# main function
if __name__ == "__main__":
  
  # calling function to create SparkSession
  spark = create_session()
  
  # creating dataframe using createDataFrame() 
  # function in which pass data and schema
  df = spark.createDataFrame([
    ("Mazda RX4",21,4,4),
    ("Hornet 4 Drive",22,3,2),
    ("Merc 240D",25,4,2),
    ("Lotus Europa",31,5,2),
    ("Ferrari Dino",20,5,6),
    ("Volvo 142E",22,4,2)
  ],["Car Name","mgp","gear","carb"])
  
  # visualizing the dataframe using show() function
  df.show()

输出：