如何更改 PySpark 数据框中的列类型?
在本文中,我们将看到如何更改 pyspark 数据框的列类型。
创建用于演示的数据框:
Python
# Create a spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkExamples').getOrCreate()
# Create a spark dataframe
columns = ["Name", "Course_Name",
"Duration_Months",
"Course_Fees", "Start_Date",
"Payment_Done"]
data = [
("Amit Pathak", "Python", 3,
10000, "02-07-2021", True),
("Shikhar Mishra", "Soft skills",
2, 8000, "07-10-2021", False),
("Shivani Suvarna", "Accounting",
6, 15000, "20-08-2021", True),
("Pooja Jain", "Data Science", 12,
60000, "02-12-2021", False),
]
course_df = spark.createDataFrame(data).toDF(*columns)
# View the dataframe
course_df.show()
Python
# View the column datatypes
course_df.printSchema()
Python
# Cast Course_Fees from integer type to float type
course_df2 = course_df.withColumn("Course_Fees",
course_df["Course_Fees"]
.cast('float'))
course_df2.printSchema()
Python
# We can also make use of datatypes from
# pyspark.sql.types
from pyspark.sql.types import StringType, DateType, FloatType
course_df3 = course_df \
.withColumn("Course_Fees" ,
course_df["Course_Fees"]
.cast(FloatType())) \
.withColumn("Payment_Done",
course_df["Payment_Done"]
.cast(StringType())) \
.withColumn("Start_Date" ,
course_df["Start_Date"]
.cast(DateType())) \
course_df3.printSchema()
Python
from pyspark.sql.types import StringType, BooleanType, IntegerType
course_df4 = course_df3.select(
course_df3.Name,
course_df3.Course_Name,
course_df3.Duration_Months,
(course_df3.Course_Fees.cast(IntegerType()))
.alias('Course_Fees'),
(course_df3.Start_Date.cast(StringType()))
.alias('Start_Date'),
(course_df3.Payment_Done.cast(BooleanType()))
.alias('Payment_Done'),
)
course_df4.printSchema()
Python
# Changing datatype of all the columns
# to string type
from pyspark.sql.types import StringType
course_df5 = course_df.select(
[course_df.cast(StringType())
.alias(c) for c in course_df.columns]
)
course_df5.printSchema()
Python
from pyspark.sql.types import (
StringType, BooleanType, IntegerType, FloatType, DateType
)
coltype_map = {
"Name": StringType(),
"Course_Name": StringType(),
"Duration_Months": IntegerType(),
"Course_Fees": FloatType(),
"Start_Date": DateType(),
"Payment_Done": BooleanType(),
}
# course_df6 has all the column
# types as string
course_df6 = course_df5.select(
[course_df5.cast(coltype_map)
.alias(c) for c in course_df5.columns]
)
course_df6.printSchema()
Python
# course_df5 has all the column datatypes as string
course_df5.createOrReplaceTempView("course_view")
course_df7 = spark.sql('''
SELECT
Name,
Course_Name,
INT(Duration_Months),
FLOAT(Course_Fees),
DATE(Start_Date),
BOOLEAN(Payment_Done)
FROM course_view
''')
course_df7.printSchema()
输出:
让我们看看数据框的架构:
Python
# View the column datatypes
course_df.printSchema()
输出:
方法 1:使用 DataFrame.withColumn()
DataFrame.withColumn(colName, col) 通过添加一列或替换具有相同名称的现有列来返回一个新的 DataFrame。
我们将使用 cast(x, dataType) 方法将列转换为不同的数据类型。这里,参数“x”是列名,dataType 是您要将相应列更改为的数据类型。
示例 1:更改单列的数据类型。
Python
# Cast Course_Fees from integer type to float type
course_df2 = course_df.withColumn("Course_Fees",
course_df["Course_Fees"]
.cast('float'))
course_df2.printSchema()
输出:
root
|-- Name: string (nullable = true)
|-- Course_Name: string (nullable = true)
|-- Duration_Months: long (nullable = true)
|-- Course_Fees: float (nullable = true)
|-- Start_Date: string (nullable = true)
|-- Payment_Done: boolean (nullable = true)
在上面的示例中,我们可以观察到“Course_Fees”列数据类型从 long 更改为 float。
示例 2:更改多列的数据类型。
Python
# We can also make use of datatypes from
# pyspark.sql.types
from pyspark.sql.types import StringType, DateType, FloatType
course_df3 = course_df \
.withColumn("Course_Fees" ,
course_df["Course_Fees"]
.cast(FloatType())) \
.withColumn("Payment_Done",
course_df["Payment_Done"]
.cast(StringType())) \
.withColumn("Start_Date" ,
course_df["Start_Date"]
.cast(DateType())) \
course_df3.printSchema()
输出:
root
|-- Name: string (nullable = true)
|-- Course_Name: string (nullable = true)
|-- Duration_Months: long (nullable = true)
|-- Course_Fees: float (nullable = true)
|-- Start_Date: date (nullable = true)
|-- Payment_Done: string (nullable = true)
在上面的示例中,我们将“Course_Fees”、“Payment_Done”和“Start_Date”列的数据类型分别更改为“float”、“str”和“datetype”。
方法 2:使用 DataFrame.select()
这里我们将使用 select()函数,该函数用于从数据框中选择列
Syntax: dataframe.select(columns)
Where dataframe is the input dataframe and columns are the input columns
示例 1:更改单个列。
让我们将“course_df3”从上述模式结构转换回原始模式。
Python
from pyspark.sql.types import StringType, BooleanType, IntegerType
course_df4 = course_df3.select(
course_df3.Name,
course_df3.Course_Name,
course_df3.Duration_Months,
(course_df3.Course_Fees.cast(IntegerType()))
.alias('Course_Fees'),
(course_df3.Start_Date.cast(StringType()))
.alias('Start_Date'),
(course_df3.Payment_Done.cast(BooleanType()))
.alias('Payment_Done'),
)
course_df4.printSchema()
输出:
root
|-- Name: string (nullable = true)
|-- Course_Name: string (nullable = true)
|-- Duration_Months: long (nullable = true)
|-- Course_Fees: integer (nullable = true)
|-- Start_Date: string (nullable = true)
|-- Payment_Done: boolean (nullable = true)
示例 2:将多个列更改为相同的数据类型。
Python
# Changing datatype of all the columns
# to string type
from pyspark.sql.types import StringType
course_df5 = course_df.select(
[course_df.cast(StringType())
.alias(c) for c in course_df.columns]
)
course_df5.printSchema()
输出:
root
|-- Name: string (nullable = true)
|-- Course_Name: string (nullable = true)
|-- Duration_Months: string (nullable = true)
|-- Course_Fees: string (nullable = true)
|-- Start_Date: string (nullable = true)
|-- Payment_Done: string (nullable = true)
示例 3:将多个列更改为不同的数据类型。
让我们使用“course_df5”,它的所有列类型都为“字符串”。我们会将列类型更改为相应的格式。
Python
from pyspark.sql.types import (
StringType, BooleanType, IntegerType, FloatType, DateType
)
coltype_map = {
"Name": StringType(),
"Course_Name": StringType(),
"Duration_Months": IntegerType(),
"Course_Fees": FloatType(),
"Start_Date": DateType(),
"Payment_Done": BooleanType(),
}
# course_df6 has all the column
# types as string
course_df6 = course_df5.select(
[course_df5.cast(coltype_map)
.alias(c) for c in course_df5.columns]
)
course_df6.printSchema()
输出:
root
|-- Name: string (nullable = true)
|-- Course_Name: string (nullable = true)
|-- Duration_Months: integer (nullable = true)
|-- Course_Fees: float (nullable = true)
|-- Start_Date: date (nullable = true)
|-- Payment_Done: boolean (nullable = true)
方法 3:使用 spark.sql()
这里我们将使用SQL 查询来更改列类型。
Syntax: spark.sql(“sql Query”)
示例:使用 spark.sql()
Python
# course_df5 has all the column datatypes as string
course_df5.createOrReplaceTempView("course_view")
course_df7 = spark.sql('''
SELECT
Name,
Course_Name,
INT(Duration_Months),
FLOAT(Course_Fees),
DATE(Start_Date),
BOOLEAN(Payment_Done)
FROM course_view
''')
course_df7.printSchema()
输出:
root
|-- Name: string (nullable = true)
|-- Course_Name: string (nullable = true)
|-- Duration_Months: integer (nullable = true)
|-- Course_Fees: float (nullable = true)
|-- Start_Date: date (nullable = true)
|-- Payment_Done: boolean (nullable = true)