如何验证 Pyspark 数据框列类型？

在使用大 Dataframe 时，Dataframe 由任意数量的具有不同数据类型的列组成。为了预处理数据以对其应用操作，我们必须知道 Dataframe 的维度和 Dataframe 中存在的列的数据类型。

在本文中，我们将了解如何验证 Dataframe 的列类型。为了验证列类型，我们使用 dtypes函数。 dtypes函数用于返回包含列名称和列类型的元组列表。

Syntax: df.dtypes()

where, df is the Dataframe

首先，我们将创建一个数据框，然后查看一些示例和实现。

Python

# importing necessary libraries
from pyspark.sql import SparkSession
  
# function to create new SparkSession
def create_session():
  spk = SparkSession.builder \
      .master("local") \
      .appName("Product_details.com") \
      .getOrCreate()
  return spk
  
def create_df(spark,data,schema):
  df1 = spark.createDataFrame(data,schema)
  return df1
  
if __name__ == "__main__":
  
  # calling function to create SparkSession
  spark = create_session()
      
  input_data = [("Mobile",112345,4.0,12499),
          ("LED TV",114567,4.2,49999),
          ("Refrigerator",123543,4.4,13899),
          ("Washing Machine",113465,3.9,6999),
          ("T-shirt",124378,4.1,1999),
          ("Jeans",126754,3.7,3999),
          ("Running Shoes",134565,4.7,1499),
          ("Face Mask",145234,4.6,999)]
  
  schema = ["Name","ID","Rating","Price"]
    
  # calling function to create dataframe
  df = create_df(spark,input_data,schema)
  
  # visualizing the dataframe
  df.show()

Python

# finding data type of the all the 
# column using dtype function and 
# printing
print(f'Data types of all the columns is : {df.dtypes}')
  
# visualizing the dataframe
df.show()

Python

# finding data type of the Rating 
# column using dtype function
data_type = dict(df.dtypes)['Rating']
  
# printing
print(f'Data type of Rating is : {data_type}')
  
# visualizing the dataframe
df.show()

Python

print("Datatype of the columns with column names are:")
  
# finding datatype of all column with
# column name using for loop
for col in df.dtypes:
    
  # printing the column and datatype 
  # of that column
  print(col[0],",",col[1])
  
# visualizing the dataframe
df.show()

Python

# printing the schema of the Dataframe
# using printscheam function
df.printSchema()
  
# visualizing the dataframe
df.show()

输出：

示例 1：使用 dtypes()函数验证 Dataframe 的列类型

在下面的示例代码中，我们创建了 Dataframe 然后为了获取 Dataframe 中存在的所有列的列类型，我们使用了 dtypes函数，方法是使用 f字符串编写df.dtypes ，同时查找我们打印的所有列的数据类型还。这给出了包含列的名称和数据类型的元组列表。

Python

# finding data type of the all the 
# column using dtype function and 
# printing
print(f'Data types of all the columns is : {df.dtypes}')
  
# visualizing the dataframe
df.show()

输出：

示例 2：验证 Dataframe 的特定列数据类型

在创建数据帧后的以下代码中，我们通过编写dict(df.dtypes)['Rating']使用 dtypes()函数查找特定列的数据类型，这里我们使用 dict 因为正如我们在上面的示例中看到的 df .dtypes 返回包含列的名称和数据类型的元组列表。所以使用 dict 我们将元组类型转换到字典中。

正如我们在字典中所知道的，数据存储在键和值对中，在编写dict(df.dtypes)['Rating'] 时，我们给出了键，即 'Rating' 并提取其值为double ，即列的数据类型。所以通过这种方式，我们可以在传递列的具体名称的同时找出列类型的数据类型。

Python

# finding data type of the Rating 
# column using dtype function
data_type = dict(df.dtypes)['Rating']
  
# printing
print(f'Data type of Rating is : {data_type}')
  
# visualizing the dataframe
df.show()

输出：

示例 3：使用 for 循环验证 Dataframe 的列类型

创建数据框后，为了查找具有列名的列的数据类型，我们使用了df.dtypes ，它为我们提供了元组列表。

在迭代时，我们将列名和列类型作为元组，然后使用print(col[0],",",col[1]) 打印列名和列类型。通过这种方式，我们通过迭代获取每个列名和列类型。

Python

print("Datatype of the columns with column names are:")
  
# finding datatype of all column with
# column name using for loop
for col in df.dtypes:
    
  # printing the column and datatype 
  # of that column
  print(col[0],",",col[1])
  
# visualizing the dataframe
df.show()

输出：

示例 4：使用架构验证 Dataframe 的列类型

在创建用于验证列类型的 Dataframe 之后，我们通过写入df.printSchema()来使用 printSchema()函数，通过Dataframe 的这个函数模式被打印出来，其中包含 Dataframe 中存在的每一列的数据类型。因此，使用 printSchema()函数我们也可以轻松验证 PySpark Dataframe 的列类型。

Python

# printing the schema of the Dataframe
# using printscheam function
df.printSchema()
  
# visualizing the dataframe
df.show()

输出：