从 PySpark DataFrame 中仅选择数字或字符串列名称(1)

📌 相关文章

📜 从 PySpark DataFrame 中仅选择数字或字符串列名称(1)

📅 最后修改于: 2023-12-03 15:36:14.823000 🧑 作者: Mango

从 PySpark DataFrame 中仅选择数字或字符串列名称

在使用 PySpark 进行数据处理时，有时候需要从 DataFrame 中仅选择数字或字符串列名称。这个需求可能由于下游处理的需要，也可能是为了过滤无用的列以加快计算速度。

以下是如何选择数字列和字符串列的方法：

选择数字列

在 PySpark 中，每个列都有一个数据类型，可通过 dtypes 方法获取。要从 DataFrame 中仅选择数字列，只需筛选出类型为 DoubleType、FloatType、IntegerType 或 LongType 的列即可。

from pyspark.sql.functions import col
from pyspark.sql.types import DoubleType, FloatType, IntegerType, LongType

def select_numeric_columns(df):
    numeric_cols = []
    for col_name, data_type in df.dtypes:
        if data_type in [DoubleType(), FloatType(), IntegerType(), LongType()]:
            numeric_cols.append(col_name)
    return df.select([col(c) for c in numeric_cols])

选择字符串列

对于字符串列，只需筛选出类型为 StringType 的列即可。

from pyspark.sql.functions import col
from pyspark.sql.types import StringType

def select_string_columns(df):
    string_cols = []
    for col_name, data_type in df.dtypes:
        if data_type == StringType():
            string_cols.append(col_name)
    return df.select([col(c) for c in string_cols])

通过上述方法，即可在 DataFrame 中仅选择数字或字符串列名称，方便后续的操作。