如何更改 PySpark 数据框中的列类型？

在本文中，我们将看到如何更改 pyspark 数据框的列类型。

创建用于演示的数据框：

Python

# Create a spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkExamples').getOrCreate()
  
# Create a spark dataframe
columns = ["Name", "Course_Name",
           "Duration_Months",
           "Course_Fees", "Start_Date",
           "Payment_Done"]
data = [
    ("Amit Pathak", "Python", 3,
     10000, "02-07-2021", True),
    ("Shikhar Mishra", "Soft skills",
     2, 8000, "07-10-2021", False),
    ("Shivani Suvarna", "Accounting",
     6, 15000, "20-08-2021", True),
    ("Pooja Jain", "Data Science", 12,
     60000, "02-12-2021", False),
]
course_df = spark.createDataFrame(data).toDF(*columns)
  
# View the dataframe
course_df.show()

Python

# View the column datatypes
course_df.printSchema()

Python

# Cast Course_Fees from integer type to float type
course_df2 = course_df.withColumn("Course_Fees", 
                                  course_df["Course_Fees"]
                                  .cast('float'))
course_df2.printSchema()

Python

# We can also make use of datatypes from 
# pyspark.sql.types
from pyspark.sql.types import StringType, DateType, FloatType
  
course_df3 = course_df \
  .withColumn("Course_Fees" ,
              course_df["Course_Fees"]
              .cast(FloatType()))   \
  .withColumn("Payment_Done",
              course_df["Payment_Done"]
              .cast(StringType()))    \
  .withColumn("Start_Date"  ,
              course_df["Start_Date"]
              .cast(DateType())) \
  
course_df3.printSchema()

Python

from pyspark.sql.types import StringType, BooleanType, IntegerType
  
course_df4 = course_df3.select(
    course_df3.Name,
    course_df3.Course_Name,
    course_df3.Duration_Months,
    (course_df3.Course_Fees.cast(IntegerType()))
    .alias('Course_Fees'),
    (course_df3.Start_Date.cast(StringType()))
    .alias('Start_Date'),
    (course_df3.Payment_Done.cast(BooleanType()))
    .alias('Payment_Done'),
)
  
course_df4.printSchema()

Python

# Changing datatype of all the columns
# to string type
from pyspark.sql.types import StringType
  
course_df5 = course_df.select(
  [course_df.cast(StringType())
   .alias(c) for c in course_df.columns]
)
course_df5.printSchema()

Python

from pyspark.sql.types import (
    StringType, BooleanType, IntegerType, FloatType, DateType
)
  
coltype_map = {
    "Name": StringType(),
    "Course_Name": StringType(),
    "Duration_Months": IntegerType(),
    "Course_Fees": FloatType(),
    "Start_Date": DateType(),
    "Payment_Done": BooleanType(),
}
  
# course_df6 has all the column
# types as string
course_df6 = course_df5.select(
    [course_df5.cast(coltype_map)
     .alias(c) for c in course_df5.columns]
)
course_df6.printSchema()

Python

# course_df5 has all the column datatypes as string
course_df5.createOrReplaceTempView("course_view")
  
course_df7 = spark.sql('''
SELECT 
  Name,
  Course_Name,
  INT(Duration_Months),
  FLOAT(Course_Fees),
  DATE(Start_Date),
  BOOLEAN(Payment_Done)
FROM course_view
''')
  
course_df7.printSchema()

输出：

让我们看看数据框的架构：

Python

# View the column datatypes
course_df.printSchema()

输出：

方法 1：使用 DataFrame.withColumn()

DataFrame.withColumn(colName, col) 通过添加一列或替换具有相同名称的现有列来返回一个新的 DataFrame。

我们将使用 cast(x, dataType) 方法将列转换为不同的数据类型。这里，参数“x”是列名，dataType 是您要将相应列更改为的数据类型。

示例 1：更改单列的数据类型。

Python

# Cast Course_Fees from integer type to float type
course_df2 = course_df.withColumn("Course_Fees", 
                                  course_df["Course_Fees"]
                                  .cast('float'))
course_df2.printSchema()

输出：

root
 |-- Name: string (nullable = true)
 |-- Course_Name: string (nullable = true)
 |-- Duration_Months: long (nullable = true)
 |-- Course_Fees: float (nullable = true)
 |-- Start_Date: string (nullable = true)
 |-- Payment_Done: boolean (nullable = true)

在上面的示例中，我们可以观察到“Course_Fees”列数据类型从 long 更改为 float。

示例 2：更改多列的数据类型。

Python

# We can also make use of datatypes from 
# pyspark.sql.types
from pyspark.sql.types import StringType, DateType, FloatType
  
course_df3 = course_df \
  .withColumn("Course_Fees" ,
              course_df["Course_Fees"]
              .cast(FloatType()))   \
  .withColumn("Payment_Done",
              course_df["Payment_Done"]
              .cast(StringType()))    \
  .withColumn("Start_Date"  ,
              course_df["Start_Date"]
              .cast(DateType())) \
  
course_df3.printSchema()

输出：

root
 |-- Name: string (nullable = true)
 |-- Course_Name: string (nullable = true)
 |-- Duration_Months: long (nullable = true)
 |-- Course_Fees: float (nullable = true)
 |-- Start_Date: date (nullable = true)
 |-- Payment_Done: string (nullable = true)

在上面的示例中，我们将“Course_Fees”、“Payment_Done”和“Start_Date”列的数据类型分别更改为“float”、“str”和“datetype”。

方法 2：使用 DataFrame.select()

这里我们将使用 select()函数，该函数用于从数据框中选择列

Syntax: dataframe.select(columns)

Where dataframe is the input dataframe and columns are the input columns

编程需要懂一点英语

示例 1：更改单个列。

让我们将“course_df3”从上述模式结构转换回原始模式。

Python

from pyspark.sql.types import StringType, BooleanType, IntegerType
  
course_df4 = course_df3.select(
    course_df3.Name,
    course_df3.Course_Name,
    course_df3.Duration_Months,
    (course_df3.Course_Fees.cast(IntegerType()))
    .alias('Course_Fees'),
    (course_df3.Start_Date.cast(StringType()))
    .alias('Start_Date'),
    (course_df3.Payment_Done.cast(BooleanType()))
    .alias('Payment_Done'),
)
  
course_df4.printSchema()

输出：

root
 |-- Name: string (nullable = true)
 |-- Course_Name: string (nullable = true)
 |-- Duration_Months: long (nullable = true)
 |-- Course_Fees: integer (nullable = true)
 |-- Start_Date: string (nullable = true)
 |-- Payment_Done: boolean (nullable = true)

示例 2：将多个列更改为相同的数据类型。

Python

# Changing datatype of all the columns
# to string type
from pyspark.sql.types import StringType
  
course_df5 = course_df.select(
  [course_df.cast(StringType())
   .alias(c) for c in course_df.columns]
)
course_df5.printSchema()

输出：

root
 |-- Name: string (nullable = true)
 |-- Course_Name: string (nullable = true)
 |-- Duration_Months: string (nullable = true)
 |-- Course_Fees: string (nullable = true)
 |-- Start_Date: string (nullable = true)
 |-- Payment_Done: string (nullable = true)

示例 3：将多个列更改为不同的数据类型。

让我们使用“course_df5”，它的所有列类型都为“字符串”。我们会将列类型更改为相应的格式。

Python

from pyspark.sql.types import (
    StringType, BooleanType, IntegerType, FloatType, DateType
)
  
coltype_map = {
    "Name": StringType(),
    "Course_Name": StringType(),
    "Duration_Months": IntegerType(),
    "Course_Fees": FloatType(),
    "Start_Date": DateType(),
    "Payment_Done": BooleanType(),
}
  
# course_df6 has all the column
# types as string
course_df6 = course_df5.select(
    [course_df5.cast(coltype_map)
     .alias(c) for c in course_df5.columns]
)
course_df6.printSchema()

输出：

root
 |-- Name: string (nullable = true)
 |-- Course_Name: string (nullable = true)
 |-- Duration_Months: integer (nullable = true)
 |-- Course_Fees: float (nullable = true)
 |-- Start_Date: date (nullable = true)
 |-- Payment_Done: boolean (nullable = true)

方法 3：使用 spark.sql()

这里我们将使用SQL 查询来更改列类型。

Syntax: spark.sql(“sql Query”)

编程需要懂一点英语

示例：使用 spark.sql()

Python

# course_df5 has all the column datatypes as string
course_df5.createOrReplaceTempView("course_view")
  
course_df7 = spark.sql('''
SELECT 
  Name,
  Course_Name,
  INT(Duration_Months),
  FLOAT(Course_Fees),
  DATE(Start_Date),
  BOOLEAN(Payment_Done)
FROM course_view
''')
  
course_df7.printSchema()

输出：

root
 |-- Name: string (nullable = true)
 |-- Course_Name: string (nullable = true)
 |-- Duration_Months: integer (nullable = true)
 |-- Course_Fees: float (nullable = true)
 |-- Start_Date: date (nullable = true)
 |-- Payment_Done: boolean (nullable = true)