Pyspark 数据框 - 将字符串映射到数字
在本文中,我们将了解如何将地图字符串转换为数字。
为演示创建数据框:
在这里,我们为大学名称创建一行数据,然后传递 createdataframe() 方法,然后显示数据框。
Python3
# importing module
import pyspark
# importing sparksession from pyspark.sql module and Row module
from pyspark.sql import SparkSession,Row
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of college data
dataframe = spark.createDataFrame([Row("vignan"),
Row("rvrjc"),
Row("klu"),
Row("rvrjc"),
Row("klu"),
Row("vignan"),
Row("iit")],
["college"])
# display dataframe
dataframe.show()
Python3
# function that converts string to numeric
def string_to_numeric(x):
# return numeric value 1 if college is iit
if(x == 'iit'):
return 1
elif(x == "vignan"):
# return numeric value 2 if college is vignan
return 2
elif(x == "rvrjc"):
# return numeric value 3 if college is rvrjc
return 3
else:
# return numeric value 4 if college
# is other than above three
return 4
# map the numeric value by using lambda
# function and rename college name as college_number
dataframe.select("college").
rdd.map(lambda x: string_to_numeric(x[0])).
map(lambda x: Row(x)).toDF(["college_number"]).show()
Python3
# import col and when modules
from pyspark.sql.functions import col, when
# map college name with college number
# using with column method along with when module
dataframe.withColumn("college_number",
when(col("college")=='iit', 1)
.when(col("college")=='vignan', 2)
.when(col("college")=='rvrjc', 3)
.otherwise(4)).show()
输出:
方法一:使用map()函数
这里我们创建了一个函数,通过 lambda 表达式将字符串转换为数字
Syntax: dataframe.select(“string_column_name”).rdd.map(lambda x: string_to_numeric(x[0])).map(lambda x: Row(x)).toDF([“numeric_column_name”]).show()
where,
- dataframe is the pyspark dataframe
- string_column_name is the actual column to be mapped to numeric_column_name
- string_to_numericis the function used to take numeric data
- lambda expression is to call the function such that numeric value is returned
在这里,我们将使用 Row 方法创建一个大学 spark 数据框,然后我们将使用 lambda函数映射数值并将大学名称重命名为 College_number。为此,我们将创建一个函数并检查条件,如果大学是 IIT,则返回数值 1,如果大学是 vignan,则返回数值 2,如果大学是 rvrjc,则返回数值 3,如果大学是其他则返回数值 4比以上三个
蟒蛇3
# function that converts string to numeric
def string_to_numeric(x):
# return numeric value 1 if college is iit
if(x == 'iit'):
return 1
elif(x == "vignan"):
# return numeric value 2 if college is vignan
return 2
elif(x == "rvrjc"):
# return numeric value 3 if college is rvrjc
return 3
else:
# return numeric value 4 if college
# is other than above three
return 4
# map the numeric value by using lambda
# function and rename college name as college_number
dataframe.select("college").
rdd.map(lambda x: string_to_numeric(x[0])).
map(lambda x: Row(x)).toDF(["college_number"]).show()
输出:
方法二:使用 withColumn() 方法。
这里我们使用 withColumn() 方法来选择列。
Syntax: dataframe.withColumn(“string_column”, when(col(“column”)==’value’, 1)).otherwise(value))
Where
- dataframe is the pyspark dataframe
- string_column is the column to be mapped to numeric
- value is the numeric value
示例:这里我们将使用 Row 方法创建一个大学 spark 数据框,并使用 with column 方法和 when() 将大学名称与大学编号映射。
蟒蛇3
# import col and when modules
from pyspark.sql.functions import col, when
# map college name with college number
# using with column method along with when module
dataframe.withColumn("college_number",
when(col("college")=='iit', 1)
.when(col("college")=='vignan', 2)
.when(col("college")=='rvrjc', 3)
.otherwise(4)).show()
输出: