如何检查 PySpark 数据框中的子字符串？

在本文中，我们将了解如何检查 PySpark 数据帧中的子字符串。

子字符串是较大字符串大小内的连续字符序列。例如，“learning pyspark”是“I am learning pyspark from GeeksForGeeks”的子串。让我们看看从 PySpark 数据帧的一列或多列中查找子字符串的不同方法。

创建用于演示的数据框：

Python

# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
  
# Column names for the dataframe
columns = ["LicenseNo", "ExpiryDate"]
  
# Row data for the dataframe
data = [
    ("MH201411094334", "2024-11-19"),
    ("AR202027563890", "2030-03-16"),
    ("UP202010345567", "2035-12-30"),
    ("KN201822347800", "2028-10-29"),
]
  
# Create the dataframe using the above values
reg_df = spark.createDataFrame(data=data,
                               schema=columns)
  
# View the dataframe
reg_df.show()

Python

from pyspark.sql.functions import substring
  
reg_df.withColumn(
  'State', substring('LicenseNo', 1, 2)
).show()

Python

from pyspark.sql.functions import substring
  
reg_df \
.withColumn('State'  , substring('LicenseNo' , 1, 2)) \
.withColumn('RegYear', substring('LicenseNo' , 3, 4)) \
.withColumn('RegID'  , substring('LicenseNo' , 7, 8)) \
.withColumn('ExpYr'  , substring('ExpiryDate', 1, 4)) \
.withColumn('ExpMo'  , substring('ExpiryDate', 6, 2)) \
.withColumn('ExpDt'  , substring('ExpiryDate', 9, 2)) \
.show()

Python

from pyspark.sql.functions import col
  
reg_df \
.withColumn('State'  , col('LicenseNo' ).substr(1, 2)) \
.withColumn('RegYear', col('LicenseNo' ).substr(3, 4)) \
.withColumn('RegID'  , col('LicenseNo' ).substr(7, 8)) \
.withColumn('ExpYr'  , col('ExpiryDate').substr(1, 4)) \
.withColumn('ExpMo'  , col('ExpiryDate').substr(6, 2)) \
.withColumn('ExpDt'  , col('ExpiryDate').substr(9, 2)) \
.show()

Python

from pyspark.sql.functions import substring
  
reg_df.select(
  substring('LicenseNo' , 1, 2).alias('State')  ,
  substring('LicenseNo' , 3, 4).alias('RegYear'),
  substring('LicenseNo' , 7, 8).alias('RegID')  ,
  substring('ExpiryDate', 1, 4).alias('ExpYr')  ,
  substring('ExpiryDate', 6, 2).alias('ExpMo')  ,
  substring('ExpiryDate', 9, 2).alias('ExpDt')  ,
).show()

Python

reg_df.createOrReplaceTempView("reg_view")
  
reg_df2 = spark.sql('''
SELECT 
  SUBSTR(LicenseNo, 1, 3)  AS State,
  SUBSTR(LicenseNo, 3, 4)  AS RegYear,
  SUBSTR(LicenseNo, 7, 8)  AS RegID,
  SUBSTR(ExpiryDate, 1, 4) AS ExpYr,
  SUBSTR(ExpiryDate, 6, 2) AS ExpMo,
  SUBSTR(ExpiryDate, 9, 2) AS ExpDt
FROM reg_view;
''')
  
reg_df2.show()

Python

from pyspark.sql.functions import substring
  
reg_df.selectExpr(
  'LicenseNo',
  'ExpiryDate',
  'substring(LicenseNo , 1, 2) AS State'   ,
  'substring(LicenseNo , 3, 4) AS RegYear' ,
  'substring(LicenseNo , 7, 8) AS RegID'   ,
  'substring(ExpiryDate, 1, 4) AS ExpYr'   ,
  'substring(ExpiryDate, 6, 2) AS ExpMo'   ,
  'substring(ExpiryDate, 9, 2) AS ExpDt'   ,
).show()

输出：

在上面的数据框中，LicenseNo 由 3 个信息组成，2 个字母的州代码 + 注册年份 + 8 位注册号。

方法一：使用DataFrame.withColumn()

DataFrame.withColumn(colName, col)可用于通过使用 pyspark 的 substring()函数从列数据中提取子字符串。

Syntax: DataFrame.withColumn(colName, col)

Parameters:

colName: str, name of the new column
col: str, a column expression for the new column

Returns a new DataFrame by adding a column or replacing the existing column that has the same name.

编程需要懂一点英语

我们将使用 pyspark 的substring()函数通过从 LicenseNo 列中提取相应的子字符串来创建一个新列“State”。

Syntax: pyspark.sql.functions.substring(str, pos, len)

编程需要懂一点英语

示例 1：对于作为子字符串的单列。

Python

from pyspark.sql.functions import substring
  
reg_df.withColumn(
  'State', substring('LicenseNo', 1, 2)
).show()

输出：

在这里，我们创建了一个新列“State”，其中子字符串取自“LicenseNo”列。 (1, 2) 表示我们需要从第一个字符开始，从“LicenseNo”列中提取2个字符。

示例 2：对于多列作为子字符串

提取州代码为'State'，注册年份为'RegYear'，注册ID为'RegID'，到期年份为'ExpYr'，到期日期为'ExpDt'，到期月份为'ExpMo'。

Python

from pyspark.sql.functions import substring
  
reg_df \
.withColumn('State'  , substring('LicenseNo' , 1, 2)) \
.withColumn('RegYear', substring('LicenseNo' , 3, 4)) \
.withColumn('RegID'  , substring('LicenseNo' , 7, 8)) \
.withColumn('ExpYr'  , substring('ExpiryDate', 1, 4)) \
.withColumn('ExpMo'  , substring('ExpiryDate', 6, 2)) \
.withColumn('ExpDt'  , substring('ExpiryDate', 9, 2)) \
.show()

输出：

上面的代码演示了如何多次使用withColumn()方法来获取多个子字符串列。每个withColumn()方法在数据框中添加一个新列。值得注意的是，它还保留了原始列。

方法 2：使用 substr 代替 substring

或者，我们也可以使用substr from 列类型而不是使用子字符串。

Syntax:pyspark.sql.Column.substr(startPos, length)

Returns a Column which is a substring of the column that starts at ‘startPos’ in byte and is of length ‘length’ when ‘str’ is Binary type.

编程需要懂一点英语

示例：使用 substr

Python

from pyspark.sql.functions import col
  
reg_df \
.withColumn('State'  , col('LicenseNo' ).substr(1, 2)) \
.withColumn('RegYear', col('LicenseNo' ).substr(3, 4)) \
.withColumn('RegID'  , col('LicenseNo' ).substr(7, 8)) \
.withColumn('ExpYr'  , col('ExpiryDate').substr(1, 4)) \
.withColumn('ExpMo'  , col('ExpiryDate').substr(6, 2)) \
.withColumn('ExpDt'  , col('ExpiryDate').substr(9, 2)) \
.show()

输出：

substr()方法与spark.sql模块中的col函数结合使用。但是，或多或少只是语法上的变化，定位逻辑保持不变。

方法 3：使用 DataFrame.select()

在这里，我们将使用 select()函数对数据帧进行子字符串化。

Syntax: pyspark.sql.DataFrame.select(*cols)

编程需要懂一点英语

示例：使用 DataFrame.select()

Python

from pyspark.sql.functions import substring
  
reg_df.select(
  substring('LicenseNo' , 1, 2).alias('State')  ,
  substring('LicenseNo' , 3, 4).alias('RegYear'),
  substring('LicenseNo' , 7, 8).alias('RegID')  ,
  substring('ExpiryDate', 1, 4).alias('ExpYr')  ,
  substring('ExpiryDate', 6, 2).alias('ExpMo')  ,
  substring('ExpiryDate', 9, 2).alias('ExpDt')  ,
).show()

输出：

方法 4：使用 'spark.sql()'

spark.sql()方法有助于在 spark 内部运行关系 SQL 查询。它允许执行关系查询，包括使用 Spark 以 SQL 表达的查询。

Syntax: spark.sql(expression)

编程需要懂一点英语

示例：使用“spark.sql()”

Python

reg_df.createOrReplaceTempView("reg_view")
  
reg_df2 = spark.sql('''
SELECT 
  SUBSTR(LicenseNo, 1, 3)  AS State,
  SUBSTR(LicenseNo, 3, 4)  AS RegYear,
  SUBSTR(LicenseNo, 7, 8)  AS RegID,
  SUBSTR(ExpiryDate, 1, 4) AS ExpYr,
  SUBSTR(ExpiryDate, 6, 2) AS ExpMo,
  SUBSTR(ExpiryDate, 9, 2) AS ExpDt
FROM reg_view;
''')
  
reg_df2.show()

输出：

在这里，我们可以看到spark.sql()中使用的表达式是一个关系 SQL 查询。我们也可以在 SQL 查询编辑器中使用它来获取相应的输出。

方法五：使用 spark.DataFrame.selectExpr()

使用selectExpr()方法是提供 SQL 查询的一种方式，但它与关系查询不同。我们可以在方法中提供一个或多个 SQL 表达式。它在一个字符串中使用一个或多个 SQL 表达式并返回一个新的 DataFrame

Syntax: selectExpr(exprs)

编程需要懂一点英语

示例：使用 spark.DataFrame.selectExpr()。

Python

from pyspark.sql.functions import substring
  
reg_df.selectExpr(
  'LicenseNo',
  'ExpiryDate',
  'substring(LicenseNo , 1, 2) AS State'   ,
  'substring(LicenseNo , 3, 4) AS RegYear' ,
  'substring(LicenseNo , 7, 8) AS RegID'   ,
  'substring(ExpiryDate, 1, 4) AS ExpYr'   ,
  'substring(ExpiryDate, 6, 2) AS ExpMo'   ,
  'substring(ExpiryDate, 9, 2) AS ExpDt'   ,
).show()

输出：

在上面的代码片段中，我们可以观察到我们在selectExpr()方法中提供了多个 SQL 表达式。这些表达式中的每一个都类似于我们编写的关系 SQL 查询的一部分。我们还通过明确提及它们来保留原始列。