PySpark – 按类型选择列

在本文中，我们将讨论如何使用Python在 PySpark 中按类型选择列。

让我们创建一个数据框进行演示

Python3

# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# import data field types
from pyspark.sql.types import StringType, DoubleType, 
IntegerType, StructType, StructField, FloatType
  
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of student  data
data = [(1, "sravan", 9.8, 4500.00), (2, "ojsawi",
                                      9.2, 6789.00),
        (3, "bobby", 8.9, 988.000)]
  
# specify column names with data types
columns = StructType([
    StructField("ID", IntegerType(), True),
    StructField("NAME", StringType(), True),
    StructField("GPA", FloatType(), True),
    StructField("FEE", DoubleType(), True),
  
])
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
  
  
# display
dataframe.show()

Python3

# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# import data field types
from pyspark.sql.types import (StringType,
DoubleType, IntegerType, StructType,
StructField, FloatType)
  
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of student  data
data = [(1, "sravan", 9.8, 4500.00), (2, "ojsawi",
                                      9.2, 6789.00), 
        (3, "bobby", 8.9, 988.000)]
  
# specify column names with data types
columns = StructType([
    StructField("ID", IntegerType(), True),
    StructField("NAME", StringType(), True),
    StructField("GPA", FloatType(), True),
    StructField("FEE", DoubleType(), True),
  
])
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
  
# select columns that are integer type
print(dataframe[[item[0]
                 for item in dataframe.dtypes if item[
                   1].startswith('int')]].collect())
  
# select columns that are string type
print(dataframe[[item[0]
                 for item in dataframe.dtypes if item[
                   1].startswith('string')]].collect())
  
# select columns that are float type
print(dataframe[[item[0]
                 for item in dataframe.dtypes if item[
                   1].startswith('float')]].collect())
  
# select columns that are double type
print(dataframe[[item[0]
                 for item in dataframe.dtypes if item[
                   1].startswith('double')]].collect())

Python3

# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# import data field types
from pyspark.sql.types import StringType, DoubleType,
IntegerType, StructType, StructField, FloatType
  
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of student  data
data = [(1, "sravan", 9.8, 4500.00), (2, "ojsawi",
                                      9.2, 6789.00), 
        (3, "bobby", 8.9, 988.000)]
  
# specify column names with data types
columns = StructType([
    StructField("ID", IntegerType(), True),
    StructField("NAME", StringType(), True),
    StructField("GPA", FloatType(), True),
    StructField("FEE", DoubleType(), True),
  
])
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
  
# select columns that are integer type
print(dataframe[[f.name for f in dataframe.schema.fields if isinstance(
    f.dataType, IntegerType)]].collect())
  
# select columns that are string type
print(dataframe[[f.name for f in dataframe.schema.fields if isinstance(
    f.dataType, StringType)]].collect())
  
# select columns that are float type
print(dataframe[[f.name for f in dataframe.schema.fields if isinstance(
    f.dataType, FloatType)]].collect())
  
# select columns that are double type
print(dataframe[[f.name for f in dataframe.schema.fields if isinstance(
    f.dataType, DoubleType)]].collect())

输出：

我们可以使用以下关键字按名称选择列：

整数：整数
字符串：字符串
浮动：浮动
双：双

方法一：使用 dtypes()

在这里，我们使用 dtypes 后跟 startswith() 方法来获取特定类型的列。

Syntax: dataframe[[item[0] for item in dataframe.dtypes if item[1].startswith(‘datatype’)]]

where,

dataframe is the input dataframe
datatype refers the keyword types
item defines the values in the column

编程需要懂一点英语

最后，我们使用 collect() 方法来显示列数据

Python3

# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# import data field types
from pyspark.sql.types import (StringType,
DoubleType, IntegerType, StructType,
StructField, FloatType)
  
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of student  data
data = [(1, "sravan", 9.8, 4500.00), (2, "ojsawi",
                                      9.2, 6789.00), 
        (3, "bobby", 8.9, 988.000)]
  
# specify column names with data types
columns = StructType([
    StructField("ID", IntegerType(), True),
    StructField("NAME", StringType(), True),
    StructField("GPA", FloatType(), True),
    StructField("FEE", DoubleType(), True),
  
])
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
  
# select columns that are integer type
print(dataframe[[item[0]
                 for item in dataframe.dtypes if item[
                   1].startswith('int')]].collect())
  
# select columns that are string type
print(dataframe[[item[0]
                 for item in dataframe.dtypes if item[
                   1].startswith('string')]].collect())
  
# select columns that are float type
print(dataframe[[item[0]
                 for item in dataframe.dtypes if item[
                   1].startswith('float')]].collect())
  
# select columns that are double type
print(dataframe[[item[0]
                 for item in dataframe.dtypes if item[
                   1].startswith('double')]].collect())

输出：

[Row(ID=1), Row(ID=2), Row(ID=3)]

[Row(NAME=’sravan’), Row(NAME=’ojsawi’), Row(NAME=’bobby’)]

[Row(GPA=9.800000190734863), Row(GPA=9.199999809265137), Row(GPA=8.899999618530273)]

[Row(FEE=4500.0), Row(FEE=6789.0), Row(FEE=988.0)]

编程需要懂一点英语

方法 2：使用 schema.fields

这里我们使用 schema.fields 方法来获取列的类型。我们正在使用 pyspark.sql.types 模块中可用的方法检查特定类型。

让我们一一检查：

整数 - 整数类型
浮点型
双 - 双类型
字符串 - 字符串类型

我们正在使用 isinstance()运算符来检查这些数据类型。

Syntax: dataframe[[f.name for f in dataframe.schema.fields if isinstance(f.dataType, datatype)]]

where,

dataframe is the input dataframe
name is the values
datatype refers to above types

编程需要懂一点英语

Python3

# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# import data field types
from pyspark.sql.types import StringType, DoubleType,
IntegerType, StructType, StructField, FloatType
  
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of student  data
data = [(1, "sravan", 9.8, 4500.00), (2, "ojsawi",
                                      9.2, 6789.00), 
        (3, "bobby", 8.9, 988.000)]
  
# specify column names with data types
columns = StructType([
    StructField("ID", IntegerType(), True),
    StructField("NAME", StringType(), True),
    StructField("GPA", FloatType(), True),
    StructField("FEE", DoubleType(), True),
  
])
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
  
# select columns that are integer type
print(dataframe[[f.name for f in dataframe.schema.fields if isinstance(
    f.dataType, IntegerType)]].collect())
  
# select columns that are string type
print(dataframe[[f.name for f in dataframe.schema.fields if isinstance(
    f.dataType, StringType)]].collect())
  
# select columns that are float type
print(dataframe[[f.name for f in dataframe.schema.fields if isinstance(
    f.dataType, FloatType)]].collect())
  
# select columns that are double type
print(dataframe[[f.name for f in dataframe.schema.fields if isinstance(
    f.dataType, DoubleType)]].collect())

输出：

[Row(ID=1), Row(ID=2), Row(ID=3)]

[Row(NAME=’sravan’), Row(NAME=’ojsawi’), Row(NAME=’bobby’)]

[Row(GPA=9.800000190734863), Row(GPA=9.199999809265137), Row(GPA=8.899999618530273)]

[Row(FEE=4500.0), Row(FEE=6789.0), Row(FEE=988.0)]

编程需要懂一点英语