从 PySpark DataFrame 中仅选择数字或字符串列名称
在本文中,我们将讨论如何从 Spark DataFrame 中仅选择数字或字符串列名称。
使用的方法:
- createDataFrame:该方法用于创建一个spark DataFrame。
- isinstance:这是一个Python函数,用于检查指定的对象是否属于指定的类型。
- dtypes:它返回一个元组列表 (columnNane,type)。返回的列表包含 DataFrame 中存在的所有列及其数据类型。
- schema.fields:用于访问 DataFrame 字段元数据。
方法#1:
在该方法中,dtypes函数用于获取元组列表 (columnNane, type)。
Python3
from pyspark.sql import Row
from datetime import date
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
# Creating dataframe from list of Row
df = spark.createDataFrame([
Row(a=1, b='string1', c=date(2021, 1, 1)),
Row(a=2, b='string2', c=date(2021, 2, 1)),
Row(a=4, b='string3', c=date(2021, 3, 1))
])
# Printing DataFrame structure
print("DataFrame structure:", df)
# Getting list of columns and printing
# result
dt = df.dtypes
print("dtypes result:", dt)
# Getting list of columns having type
# string or bigint
# This statement will loop over all the
# tuples present in dt list
# item[0] will contain column name and
# item[1] will contain column type
columnList = [item[0] for item in dt if item[1].startswith(
'string') or item[1].startswith('bigint')]
print("Result: ", columnList)
Python3
from pyspark.sql.types import StringType, LongType
from pyspark.sql import Row
from datetime import date
from pyspark.sql import SparkSession
# Initializing spark session
spark = SparkSession.builder.getOrCreate()
# Creating dataframe from list of Row
df = spark.createDataFrame([
Row(a=1, b='string1', c=date(2021, 1, 1)),
Row(a=2, b='string2', c=date(2021, 2, 1)),
Row(a=4, b='string3', c=date(2021, 3, 1))
])
# Printing DataFrame structure
print("DataFrame structure:", df)
# Getting and printing metadata
meta = df.schema.fields
print("Metadata: ", meta)
# Getting list of columns having type
# string or int
# This statement will loop over all the fields
# field.name will return column name and
# field.dataType will return column type
columnList = [field.name for field in df.schema.fields if isinstance(
field.dataType, StringType) or isinstance(field.dataType, LongType)]
print("Result: ", columnList)
输出:
DataFrame structure: DataFrame[a: bigint, b: string, c: date]
dtypes result: [('a', 'bigint'), ('b', 'string'), ('c', 'date')]
Result: ['a', 'b']
方法#2:
在这个方法模式中。 fields 用于获取字段元数据,然后从元数据中提取列数据类型并与所需的数据类型进行比较。
蟒蛇3
from pyspark.sql.types import StringType, LongType
from pyspark.sql import Row
from datetime import date
from pyspark.sql import SparkSession
# Initializing spark session
spark = SparkSession.builder.getOrCreate()
# Creating dataframe from list of Row
df = spark.createDataFrame([
Row(a=1, b='string1', c=date(2021, 1, 1)),
Row(a=2, b='string2', c=date(2021, 2, 1)),
Row(a=4, b='string3', c=date(2021, 3, 1))
])
# Printing DataFrame structure
print("DataFrame structure:", df)
# Getting and printing metadata
meta = df.schema.fields
print("Metadata: ", meta)
# Getting list of columns having type
# string or int
# This statement will loop over all the fields
# field.name will return column name and
# field.dataType will return column type
columnList = [field.name for field in df.schema.fields if isinstance(
field.dataType, StringType) or isinstance(field.dataType, LongType)]
print("Result: ", columnList)
输出:
DataFrame structure: DataFrame[a: bigint, b: string, c: date]
Metadata: [StructField(a,LongType,true), StructField(b,StringType,true), StructField(c,DateType,true)]
Result: [‘a’, ‘b’]