📜  在 PySpark DataFrame 中将单列拆分为多列

📅  最后修改于: 2022-05-13 01:54:59.562000             🧑  作者: Mango

在 PySpark DataFrame 中将单列拆分为多列

pyspark.sql.functions提供了一个函数split() ,用于将 DataFrame字符串Column 拆分为多列。

首先让我们创建一个DataFrame。

Python3
# installing pyspark
!pip install pyspark
  
# importing pyspark
import pyspark
  
# importing SparkSession
from pyspark.sql import SparkSession
  
# importing all from pyspark.sql.function
from pyspark.sql.functions import *
  
# creating SparkSession object
spark=SparkSession.builder.appName("sparkdf").getOrCreate()
  
  
# creating the row data for dataframe
data = [('Jaya', 'Sinha', 'F', '1991-04-01'),
        ('Milan', 'Sharma', '', '2000-05-19'),
        ('Rohit', 'Verma', 'M', '1978-09-05'),
        ('Maria', 'Anne', 'F', '1967-12-01'),
        ('Jay', 'Mehta', 'M', '1980-02-17')
        ]
  
# giving the column names for the dataframe
columns = ['First Name', 'Last Name', 'Gender', 'DOB']
  
# creating the dataframe df
df = spark.createDataFrame(data, columns)
  
# printing dataframe schema
df.printSchema()
  
# show dataframe
df.show()


Python3
# split() function defining parameters
split_cols = pyspark.sql.functions.split(df['DOB'], '-')
  
# Now applying split() using withColumn()
df1 = df.withColumn('Year', split_cols.getItem(0)) \
    .withColumn('Month', split_cols.getItem(1)) \
    .withColumn('Day', split_cols.getItem(2))
  
# show df
df1.show()


Python3
# defining split() along with withColumn()
df2 = df.withColumn('Year', split(df['DOB'], '-').getItem(0)) \
    .withColumn('Month', split(df['DOB'], '-').getItem(1)) \
    .withColumn('Day', split(df['DOB'], '-').getItem(2))
  
# show df2
df2.show()


Python3
# creating the row data for dataframe
data = [('Jaya', 'Sinha', 'F', '1991-04-01'),
        ('Milan', 'Sharma', '', '2000-05-19'),
        ('Rohit', 'Verma', 'M', '1978-09-05'),
        ('Maria', 'Anne', 'F', '1967-12-01'),
        ('Jay', 'Mehta', 'M', '1980-02-17')
        ]
  
# giving the column names for the dataframe
columns = ['First Name', 'Last Name', 'DOB']
  
# creating the dataframe df
df = spark.createDataFrame(data, columns)
  
# printing dataframe schema
df.printSchema()
  
# show dataframe
df.show()
  
# defining split ()
split_cols = pyspark.sql.functions.split(df['DOB'], '-')
  
# applying split() using select()
df3 = df.select('First Name', 'Last Name', 'Gender', 'DOB',
                split_cols.getItem(0).alias('year'),
                split_cols.getItem(1).alias('month'),
                split_cols.getItem(2).alias('day'))
  
# show df3
df3.show()


Python3
# creating the row data for dataframe
data = [('Jaya', 'Sinha'), ('Milan', 'Soni'),
        ('Rohit', 'Verma'), ('Maria', 'Anne'), 
        ('Jay', 'Mehta')]
  
# giving the column names for the dataframe
columns = ['First Name', 'Last Name']
  
# creating the dataframe df
df = spark.createDataFrame(data, columns)
  
# printing dataframe schema
df.printSchema()
  
# show dataframe
df.show()
  
# defining split()
split_cols = pyspark.sql.functions.split(df['Last Name'], '')
  
# applying split() using .withColumn()
df = df.withColumn('1', split_cols.getItem(0)) \
       .withColumn('2', split_cols.getItem(1)) \
       .withColumn('3', split_cols.getItem(2)) \
       .withColumn('4', split_cols.getItem(3)) \
       .withColumn('5', split_cols.getItem(4))
  
# show df
df.show()


输出:

数据框已创建

示例 1:使用 withColumn() 拆分列

在此示例中,我们创建了一个简单的数据框,其中包含 yyyy-mm-dd字符串格式的出生日期列“DOB”。使用 split 和withColumn()将列拆分为年、月和日期列。

蟒蛇3

# split() function defining parameters
split_cols = pyspark.sql.functions.split(df['DOB'], '-')
  
# Now applying split() using withColumn()
df1 = df.withColumn('Year', split_cols.getItem(0)) \
    .withColumn('Month', split_cols.getItem(1)) \
    .withColumn('Day', split_cols.getItem(2))
  
# show df
df1.show()

输出:

拆分列后的数据框

或者,我们也可以这样写,它会给出相同的输出:

蟒蛇3

# defining split() along with withColumn()
df2 = df.withColumn('Year', split(df['DOB'], '-').getItem(0)) \
    .withColumn('Month', split(df['DOB'], '-').getItem(1)) \
    .withColumn('Day', split(df['DOB'], '-').getItem(2))
  
# show df2
df2.show()

输出:



在上面的示例中,我们使用了 split() 的 2 个参数,即包含列名的“str”,“pattern”包含该列中存在的数据的模式类型,并从该位置拆分数据。

示例 2:使用 select() 拆分列

在此示例中,我们将使用相同的 DataFrame df 并使用 .select() 拆分其“DOB”列:

蟒蛇3

# creating the row data for dataframe
data = [('Jaya', 'Sinha', 'F', '1991-04-01'),
        ('Milan', 'Sharma', '', '2000-05-19'),
        ('Rohit', 'Verma', 'M', '1978-09-05'),
        ('Maria', 'Anne', 'F', '1967-12-01'),
        ('Jay', 'Mehta', 'M', '1980-02-17')
        ]
  
# giving the column names for the dataframe
columns = ['First Name', 'Last Name', 'DOB']
  
# creating the dataframe df
df = spark.createDataFrame(data, columns)
  
# printing dataframe schema
df.printSchema()
  
# show dataframe
df.show()
  
# defining split ()
split_cols = pyspark.sql.functions.split(df['DOB'], '-')
  
# applying split() using select()
df3 = df.select('First Name', 'Last Name', 'Gender', 'DOB',
                split_cols.getItem(0).alias('year'),
                split_cols.getItem(1).alias('month'),
                split_cols.getItem(2).alias('day'))
  
# show df3
df3.show()

输出:

在上面的例子中,我们没有在 select() 中选择 'Gender' 列,所以它在结果 df3 中不可见。

示例 3:拆分另一个字符串列

蟒蛇3

# creating the row data for dataframe
data = [('Jaya', 'Sinha'), ('Milan', 'Soni'),
        ('Rohit', 'Verma'), ('Maria', 'Anne'), 
        ('Jay', 'Mehta')]
  
# giving the column names for the dataframe
columns = ['First Name', 'Last Name']
  
# creating the dataframe df
df = spark.createDataFrame(data, columns)
  
# printing dataframe schema
df.printSchema()
  
# show dataframe
df.show()
  
# defining split()
split_cols = pyspark.sql.functions.split(df['Last Name'], '')
  
# applying split() using .withColumn()
df = df.withColumn('1', split_cols.getItem(0)) \
       .withColumn('2', split_cols.getItem(1)) \
       .withColumn('3', split_cols.getItem(2)) \
       .withColumn('4', split_cols.getItem(3)) \
       .withColumn('5', split_cols.getItem(4))
  
# show df
df.show()

输出:

在上面的示例中,我们只采用了 First Name 和 Last Name 两列,并将 Last Name 列值拆分为多个列中的单个字符。