将逗号分隔的字符串转换为 PySpark 数据帧中的数组
在本文中,我们将学习如何在 pyspark 数据帧中将逗号分隔的字符串转换为数组。
在 pyspark SQL 中, split()函数将分隔符分隔的 String 转换为 Array。它是通过基于分隔符(如空格、逗号)拆分字符串并将它们堆叠到数组中来完成的。此函数返回数组类型的 pyspark.sql.Column。
Syntax: pyspark.sql.functions.split(str, pattern, limit=-1)
Parameter:
- str:- The string to be split.
- limit:- an integer that controls the number of times pattern is applied
- pattern:- The delimiter that is used to split the string.
例子
让我们看几个例子来理解代码的工作。
示例 1:使用字符串值
让我们看一个示例,看看 split函数的作用。对于此示例,我们创建了自定义数据框并使用 split函数创建了一个与学生姓名联系的姓名。在这里,我们将拆分应用于字符串数据格式列。
Python3
# import required modules
from pyspark.sql import SparkSession
from pyspark.sql.functions import split, col
# start the spark session
spark = SparkSession.builder \
.appName('GeeksforGeeks') \
.getOrCreate()
# create the dataframe
data = [("Pulkit, Dhingra","M",70),
("Ritika, Pandey","F",85),
("Kaif, Ali","M",63),
("Asha, Deep","F",62)
]
columns=["Name","Gender","Marks"]
df=spark.createDataFrame(data,columns)
# use split function
df2 = df.select(split(col("Name"),",").alias("Name_Arr"),
col("Gender"),col("Marks")) \
.drop("Name")
df2.show()
# stop session
spark.stop()
Python3
# import required modules
from pyspark.sql import SparkSession
from pyspark.sql.functions import split, col
from pyspark.sql.types import ArrayType, IntegerType
# start the spark session
spark = SparkSession.builder \
.appName('GeeksforGeeks') \
.getOrCreate()
# create the dataframe
data = [("Pulkit, Dhingra","M","70,85"),
("Ritika, Pandey","F","85,95"),
("Kaif, Ali","M","63,72"),
("Asha, Deep","F","62,92")
]
columns=["Name","Gender","Marks"]
df=spark.createDataFrame(data,columns)
df.show()
# use split function
df2 = df.select(col("Name"),col("Gender"),
split(col("Marks"),",").cast(
ArrayType(IntegerType())).alias("Marks_Arr"))
df2.show()
# stop session
spark.stop()
Python3
# If you want to convert data to numeric
# types you can cast as follows
import findspark
findspark.init('c:/spark')
# import required modules
from pyspark.sql import SparkSession
from pyspark.sql.functions import split, col
from pyspark.sql.types import ArrayType, IntegerType
def return_array(column):
return split(col(column),",")
# start the spark session
spark = SparkSession.builder \
.appName('GeeksforGeeks') \
.getOrCreate()
# create the dataframe
data = [("Pulkit, Dhingra","M","70,85"),
("Ritika, Pandey","F","85,95"),
("Kaif, Ali","M","63,72"),
("Asha, Deep","F","62,92")
]
columns=["Name","Gender","Marks"]
df=spark.createDataFrame(data,columns)
df.show()
# use split function
df2 = df.select(split(col("Name"),",").alias("Name"),
split(col("Gender"),",").alias("Gender"),
split(col("Marks"),",").alias("Marks_Arr"))
df2.show()
# stop session
spark.stop()
输出:
示例 2:处理整数值
如果我们想转换为数字类型,我们可以使用 cast()函数和 split()函数。在这个例子中,我们使用 cast()函数来构建一个整数数组,所以我们将使用 cast(ArrayType(IntegerType())) ,它明确指定我们需要转换为整数类型的数组。
蟒蛇3
# import required modules
from pyspark.sql import SparkSession
from pyspark.sql.functions import split, col
from pyspark.sql.types import ArrayType, IntegerType
# start the spark session
spark = SparkSession.builder \
.appName('GeeksforGeeks') \
.getOrCreate()
# create the dataframe
data = [("Pulkit, Dhingra","M","70,85"),
("Ritika, Pandey","F","85,95"),
("Kaif, Ali","M","63,72"),
("Asha, Deep","F","62,92")
]
columns=["Name","Gender","Marks"]
df=spark.createDataFrame(data,columns)
df.show()
# use split function
df2 = df.select(col("Name"),col("Gender"),
split(col("Marks"),",").cast(
ArrayType(IntegerType())).alias("Marks_Arr"))
df2.show()
# stop session
spark.stop()
输出:
示例 3:同时使用整数和字符串值
可能存在一种情况,我们需要检查每一列,如果存在以逗号分隔的列值,则进行拆分。 split()函数有很多优点。可能存在分隔符不存在于列中的情况。 split()函数通过创建列值的单个数组而不是给出异常来处理这种情况。这有时可能会派上用场。
蟒蛇3
# If you want to convert data to numeric
# types you can cast as follows
import findspark
findspark.init('c:/spark')
# import required modules
from pyspark.sql import SparkSession
from pyspark.sql.functions import split, col
from pyspark.sql.types import ArrayType, IntegerType
def return_array(column):
return split(col(column),",")
# start the spark session
spark = SparkSession.builder \
.appName('GeeksforGeeks') \
.getOrCreate()
# create the dataframe
data = [("Pulkit, Dhingra","M","70,85"),
("Ritika, Pandey","F","85,95"),
("Kaif, Ali","M","63,72"),
("Asha, Deep","F","62,92")
]
columns=["Name","Gender","Marks"]
df=spark.createDataFrame(data,columns)
df.show()
# use split function
df2 = df.select(split(col("Name"),",").alias("Name"),
split(col("Gender"),",").alias("Gender"),
split(col("Marks"),",").alias("Marks_Arr"))
df2.show()
# stop session
spark.stop()
输出: