📌  相关文章
📜  将逗号分隔的字符串转换为 PySpark 数据帧中的数组

📅  最后修改于: 2022-05-13 01:55:40.939000             🧑  作者: Mango

将逗号分隔的字符串转换为 PySpark 数据帧中的数组

在本文中,我们将学习如何在 pyspark 数据帧中将逗号分隔的字符串转换为数组。

在 pyspark SQL 中, split()函数将分隔符分隔的 String 转换为 Array。它是通过基于分隔符(如空格、逗号)拆分字符串并将它们堆叠到数组中来完成的。此函数返回数组类型的 pyspark.sql.Column。

例子

让我们看几个例子来理解代码的工作。

示例 1:使用字符串值

让我们看一个示例,看看 split函数的作用。对于此示例,我们创建了自定义数据框并使用 split函数创建了一个与学生姓名联系的姓名。在这里,我们将拆分应用于字符串数据格式列。



Python3
# import required modules
from pyspark.sql import SparkSession
from pyspark.sql.functions import split, col
  
# start the spark session
spark = SparkSession.builder \
         .appName('GeeksforGeeks') \
         .getOrCreate()
  
# create the dataframe
data = [("Pulkit, Dhingra","M",70),
            ("Ritika, Pandey","F",85),
            ("Kaif, Ali","M",63),
            ("Asha, Deep","F",62)
            ]
  
columns=["Name","Gender","Marks"]
df=spark.createDataFrame(data,columns)
  
# use split function 
df2 = df.select(split(col("Name"),",").alias("Name_Arr"),
                col("Gender"),col("Marks")) \
    .drop("Name")
  
df2.show()
  
# stop session
spark.stop()


Python3
# import required modules
from pyspark.sql import SparkSession
from pyspark.sql.functions import split, col
from pyspark.sql.types import ArrayType, IntegerType
  
# start the spark session
spark = SparkSession.builder \
         .appName('GeeksforGeeks') \
         .getOrCreate()
  
# create the dataframe
data = [("Pulkit, Dhingra","M","70,85"),
            ("Ritika, Pandey","F","85,95"),
            ("Kaif, Ali","M","63,72"),
            ("Asha, Deep","F","62,92")
            ]
  
columns=["Name","Gender","Marks"]
df=spark.createDataFrame(data,columns)
df.show()
  
# use split function 
df2 = df.select(col("Name"),col("Gender"),
split(col("Marks"),",").cast(
  ArrayType(IntegerType())).alias("Marks_Arr"))
  
df2.show()
  
# stop session
spark.stop()


Python3
# If you want to convert data to numeric
# types you can cast as follows
import findspark
findspark.init('c:/spark')
  
# import required modules
from pyspark.sql import SparkSession
from pyspark.sql.functions import split, col
from pyspark.sql.types import ArrayType, IntegerType
  
def return_array(column):
        return split(col(column),",")
  
# start the spark session
spark = SparkSession.builder \
         .appName('GeeksforGeeks') \
         .getOrCreate()
  
# create the dataframe
data = [("Pulkit, Dhingra","M","70,85"),
            ("Ritika, Pandey","F","85,95"),
            ("Kaif, Ali","M","63,72"),
            ("Asha, Deep","F","62,92")
            ]
  
columns=["Name","Gender","Marks"]
df=spark.createDataFrame(data,columns)
df.show()
  
# use split function 
df2 = df.select(split(col("Name"),",").alias("Name"),
    split(col("Gender"),",").alias("Gender"),
    split(col("Marks"),",").alias("Marks_Arr"))
df2.show()
  
# stop session
spark.stop()


输出:

示例 2:处理整数值

如果我们想转换为数字类型,我们可以使用 cast()函数和 split()函数。在这个例子中,我们使用 cast()函数来构建一个整数数组,所以我们将使用 cast(ArrayType(IntegerType())) ,它明确指定我们需要转换为整数类型的数组。

蟒蛇3

# import required modules
from pyspark.sql import SparkSession
from pyspark.sql.functions import split, col
from pyspark.sql.types import ArrayType, IntegerType
  
# start the spark session
spark = SparkSession.builder \
         .appName('GeeksforGeeks') \
         .getOrCreate()
  
# create the dataframe
data = [("Pulkit, Dhingra","M","70,85"),
            ("Ritika, Pandey","F","85,95"),
            ("Kaif, Ali","M","63,72"),
            ("Asha, Deep","F","62,92")
            ]
  
columns=["Name","Gender","Marks"]
df=spark.createDataFrame(data,columns)
df.show()
  
# use split function 
df2 = df.select(col("Name"),col("Gender"),
split(col("Marks"),",").cast(
  ArrayType(IntegerType())).alias("Marks_Arr"))
  
df2.show()
  
# stop session
spark.stop()

输出:



示例 3:同时使用整数和字符串值

可能存在一种情况,我们需要检查每一列,如果存在以逗号分隔的列值,则进行拆分。 split()函数有很多优点。可能存在分隔符不存在于列中的情况。 split()函数通过创建列值的单个数组而不是给出异常来处理这种情况。这有时可能会派上用场。

蟒蛇3

# If you want to convert data to numeric
# types you can cast as follows
import findspark
findspark.init('c:/spark')
  
# import required modules
from pyspark.sql import SparkSession
from pyspark.sql.functions import split, col
from pyspark.sql.types import ArrayType, IntegerType
  
def return_array(column):
        return split(col(column),",")
  
# start the spark session
spark = SparkSession.builder \
         .appName('GeeksforGeeks') \
         .getOrCreate()
  
# create the dataframe
data = [("Pulkit, Dhingra","M","70,85"),
            ("Ritika, Pandey","F","85,95"),
            ("Kaif, Ali","M","63,72"),
            ("Asha, Deep","F","62,92")
            ]
  
columns=["Name","Gender","Marks"]
df=spark.createDataFrame(data,columns)
df.show()
  
# use split function 
df2 = df.select(split(col("Name"),",").alias("Name"),
    split(col("Gender"),",").alias("Gender"),
    split(col("Marks"),",").alias("Marks_Arr"))
df2.show()
  
# stop session
spark.stop()

输出: