在 PySpark DataFrame 中将单列拆分为多列
pyspark.sql.functions提供了一个函数split() ,用于将 DataFrame字符串Column 拆分为多列。
Syntax: pyspark.sql.functions.split(str, pattern, limit=- 1)
Parameters:
- str: str is a Column or str to split.
- pattern: It is a str parameter, a string that represents a regular expression. This should be a Java regular expression.
- limit: It is an int parameter. Optional an integer value when specified controls the number of times the pattern is applied.
- limit > 0: The resulting array length must not be more than limit specified.
- limit <= 0: The pattern must be applied as many times as possible or till the limit.
首先让我们创建一个DataFrame。
Python3
# installing pyspark
!pip install pyspark
# importing pyspark
import pyspark
# importing SparkSession
from pyspark.sql import SparkSession
# importing all from pyspark.sql.function
from pyspark.sql.functions import *
# creating SparkSession object
spark=SparkSession.builder.appName("sparkdf").getOrCreate()
# creating the row data for dataframe
data = [('Jaya', 'Sinha', 'F', '1991-04-01'),
('Milan', 'Sharma', '', '2000-05-19'),
('Rohit', 'Verma', 'M', '1978-09-05'),
('Maria', 'Anne', 'F', '1967-12-01'),
('Jay', 'Mehta', 'M', '1980-02-17')
]
# giving the column names for the dataframe
columns = ['First Name', 'Last Name', 'Gender', 'DOB']
# creating the dataframe df
df = spark.createDataFrame(data, columns)
# printing dataframe schema
df.printSchema()
# show dataframe
df.show()
Python3
# split() function defining parameters
split_cols = pyspark.sql.functions.split(df['DOB'], '-')
# Now applying split() using withColumn()
df1 = df.withColumn('Year', split_cols.getItem(0)) \
.withColumn('Month', split_cols.getItem(1)) \
.withColumn('Day', split_cols.getItem(2))
# show df
df1.show()
Python3
# defining split() along with withColumn()
df2 = df.withColumn('Year', split(df['DOB'], '-').getItem(0)) \
.withColumn('Month', split(df['DOB'], '-').getItem(1)) \
.withColumn('Day', split(df['DOB'], '-').getItem(2))
# show df2
df2.show()
Python3
# creating the row data for dataframe
data = [('Jaya', 'Sinha', 'F', '1991-04-01'),
('Milan', 'Sharma', '', '2000-05-19'),
('Rohit', 'Verma', 'M', '1978-09-05'),
('Maria', 'Anne', 'F', '1967-12-01'),
('Jay', 'Mehta', 'M', '1980-02-17')
]
# giving the column names for the dataframe
columns = ['First Name', 'Last Name', 'DOB']
# creating the dataframe df
df = spark.createDataFrame(data, columns)
# printing dataframe schema
df.printSchema()
# show dataframe
df.show()
# defining split ()
split_cols = pyspark.sql.functions.split(df['DOB'], '-')
# applying split() using select()
df3 = df.select('First Name', 'Last Name', 'Gender', 'DOB',
split_cols.getItem(0).alias('year'),
split_cols.getItem(1).alias('month'),
split_cols.getItem(2).alias('day'))
# show df3
df3.show()
Python3
# creating the row data for dataframe
data = [('Jaya', 'Sinha'), ('Milan', 'Soni'),
('Rohit', 'Verma'), ('Maria', 'Anne'),
('Jay', 'Mehta')]
# giving the column names for the dataframe
columns = ['First Name', 'Last Name']
# creating the dataframe df
df = spark.createDataFrame(data, columns)
# printing dataframe schema
df.printSchema()
# show dataframe
df.show()
# defining split()
split_cols = pyspark.sql.functions.split(df['Last Name'], '')
# applying split() using .withColumn()
df = df.withColumn('1', split_cols.getItem(0)) \
.withColumn('2', split_cols.getItem(1)) \
.withColumn('3', split_cols.getItem(2)) \
.withColumn('4', split_cols.getItem(3)) \
.withColumn('5', split_cols.getItem(4))
# show df
df.show()
输出:
示例 1:使用 withColumn() 拆分列
在此示例中,我们创建了一个简单的数据框,其中包含 yyyy-mm-dd字符串格式的出生日期列“DOB”。使用 split 和withColumn()将列拆分为年、月和日期列。
蟒蛇3
# split() function defining parameters
split_cols = pyspark.sql.functions.split(df['DOB'], '-')
# Now applying split() using withColumn()
df1 = df.withColumn('Year', split_cols.getItem(0)) \
.withColumn('Month', split_cols.getItem(1)) \
.withColumn('Day', split_cols.getItem(2))
# show df
df1.show()
输出:
或者,我们也可以这样写,它会给出相同的输出:
蟒蛇3
# defining split() along with withColumn()
df2 = df.withColumn('Year', split(df['DOB'], '-').getItem(0)) \
.withColumn('Month', split(df['DOB'], '-').getItem(1)) \
.withColumn('Day', split(df['DOB'], '-').getItem(2))
# show df2
df2.show()
输出:
在上面的示例中,我们使用了 split() 的 2 个参数,即包含列名的“str”,“pattern”包含该列中存在的数据的模式类型,并从该位置拆分数据。
示例 2:使用 select() 拆分列
在此示例中,我们将使用相同的 DataFrame df 并使用 .select() 拆分其“DOB”列:
蟒蛇3
# creating the row data for dataframe
data = [('Jaya', 'Sinha', 'F', '1991-04-01'),
('Milan', 'Sharma', '', '2000-05-19'),
('Rohit', 'Verma', 'M', '1978-09-05'),
('Maria', 'Anne', 'F', '1967-12-01'),
('Jay', 'Mehta', 'M', '1980-02-17')
]
# giving the column names for the dataframe
columns = ['First Name', 'Last Name', 'DOB']
# creating the dataframe df
df = spark.createDataFrame(data, columns)
# printing dataframe schema
df.printSchema()
# show dataframe
df.show()
# defining split ()
split_cols = pyspark.sql.functions.split(df['DOB'], '-')
# applying split() using select()
df3 = df.select('First Name', 'Last Name', 'Gender', 'DOB',
split_cols.getItem(0).alias('year'),
split_cols.getItem(1).alias('month'),
split_cols.getItem(2).alias('day'))
# show df3
df3.show()
输出:
在上面的例子中,我们没有在 select() 中选择 'Gender' 列,所以它在结果 df3 中不可见。
示例 3:拆分另一个字符串列
蟒蛇3
# creating the row data for dataframe
data = [('Jaya', 'Sinha'), ('Milan', 'Soni'),
('Rohit', 'Verma'), ('Maria', 'Anne'),
('Jay', 'Mehta')]
# giving the column names for the dataframe
columns = ['First Name', 'Last Name']
# creating the dataframe df
df = spark.createDataFrame(data, columns)
# printing dataframe schema
df.printSchema()
# show dataframe
df.show()
# defining split()
split_cols = pyspark.sql.functions.split(df['Last Name'], '')
# applying split() using .withColumn()
df = df.withColumn('1', split_cols.getItem(0)) \
.withColumn('2', split_cols.getItem(1)) \
.withColumn('3', split_cols.getItem(2)) \
.withColumn('4', split_cols.getItem(3)) \
.withColumn('5', split_cols.getItem(4))
# show df
df.show()
输出:
在上面的示例中,我们只采用了 First Name 和 Last Name 两列,并将 Last Name 列值拆分为多个列中的单个字符。