PySpark 数据框基于其他列添加列

在本文中，我们将了解如何将基于另一列的列添加到 Pyspark Dataframe。

创建用于演示的数据框：

在这里，我们将从给定数据集的列表中创建一个数据框。

Python3

# Create a spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkExamples').getOrCreate()
  
# Create a spark dataframe
columns = ["Name", "Course_Name",
           "Months",
           "Course_Fees", "Discount",
           "Start_Date", "Payment_Done"]
data = [
    ("Amit Pathak", "Python", 3, 10000, 1000,
     "02-07-2021", True),
    ("Shikhar Mishra", "Soft skills", 2,
     8000, 800, "07-10-2021", False),
    ("Shivani Suvarna", "Accounting", 6,
     15000, 1500, "20-08-2021", True),
    ("Pooja Jain", "Data Science", 12,
     60000, 900, "02-12-2021", False),
]
  
df = spark.createDataFrame(data).toDF(*columns)
  
# View the dataframe
df.show()

Python3

new_df = df.withColumn('After_discount',
                       df.Course_Fees - df.Discount)
new_df.show()

Python3

df.registerTempTable('table')
newDF = spark.sql('select *, Course_Fees - Discount as Total from table')
newDF.show()

Python3

# import the functions as F from pyspark.sql
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType
  
# define the sum_col
def Total(Course_Fees, Discount):
    res = Course_Fees - Discount
    return res
  
# integer datatype is defined
new_f = F.udf(Total, IntegerType())
  
# calling and creating the new
# col as udf_method_sum
new_df = df.withColumn(
  "Total_price", new_f("Course_Fees", "Discount"))
  
# Showing the Dataframe
new_df.show()

输出：

方法 1：使用 withColumns()

它用于更改值、转换现有列的数据类型、创建新列等等。

Syntax: df.withColumn(colName, col)

Returns: A new :class:`DataFrame` by adding a column or replacing the existing column that has the same name.

编程需要懂一点英语

蟒蛇3

new_df = df.withColumn('After_discount',
                       df.Course_Fees - df.Discount)
new_df.show()

输出：

方法二：使用SQL查询

这里我们将在 Pyspark 中使用 SQL 查询，我们将在 createTempView() 的帮助下创建表的临时视图，并且该临时表的生命周期取决于 sparkSession 的生命周期。 registerTempTable() 将创建临时表，如果它不可用，或者如果它可用则替换它。

然后在创建表后通过 SQL 子句选择表，它将所有值作为一个字符串。

蟒蛇3

df.registerTempTable('table')
newDF = spark.sql('select *, Course_Fees - Discount as Total from table')
newDF.show()

输出：

方法 3：使用 UDF

在此方法中，我们将定义用户定义的函数，该函数将接受两个参数并返回总价。这个函数允许我们根据我们的要求创建一个新函数。

现在我们定义 UDF函数的数据类型并创建将返回值的函数，该值是行中所有值的总和。

蟒蛇3

# import the functions as F from pyspark.sql
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType
  
# define the sum_col
def Total(Course_Fees, Discount):
    res = Course_Fees - Discount
    return res
  
# integer datatype is defined
new_f = F.udf(Total, IntegerType())
  
# calling and creating the new
# col as udf_method_sum
new_df = df.withColumn(
  "Total_price", new_f("Course_Fees", "Discount"))
  
# Showing the Dataframe
new_df.show()

输出：