📅  最后修改于: 2023-12-03 15:00:01.009000             🧑  作者: Mango
The computeCost
function is used in PySpark (Python) to calculate the sum of squared errors between predicted and actual values in linear regression problems. It is a built-in method of the LinearRegressionModel
class, which is used to fit linear regression models.
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.functions import *
from pyspark.sql import SparkSession
# Creating SparkSession
spark = SparkSession.builder.appName('Linear_Regression').getOrCreate()
# Creating Spark Dataframe
df = spark.read.csv('path/to/csv/file', header=True, inferSchema=True)
# Creating vector assembler
assembler = VectorAssembler(inputCols=['input_col_1', 'input_col_2'], outputCol='features')
df = assembler.transform(df)
# Splitting the Dataset
train_data, test_data = df.randomSplit([0.7, 0.3])
# Creating Linear Regression Model
lr = LinearRegression(featuresCol = 'features', labelCol='target_col')
# Fitting the Model
lr_model = lr.fit(train_data)
# Calculating the Sum of Squared Errors
test_results = lr_model.transform(test_data)
test_results.select('prediction', 'target_col').show()
from pyspark.ml.evaluation import RegressionEvaluator
# Calculating RMSE
evaluator = RegressionEvaluator(labelCol='target_col', predictionCol='prediction', metricName='rmse')
rmse = evaluator.evaluate(test_results)
print('RMSE:', rmse)
# Calculating the Sum of Squared Errors
sse = lr_model.summary.totalCost
print('Sum of Squares Error:', sse)
In this example, we first create a SparkSession and load a CSV file into a Spark dataframe. Next, we create a vector assembler to combine our input features into a single column. We then split the dataframe into training and testing data, and use this to train our linear regression model.
After fitting the model, we calculate the root mean squared error (RMSE) using a RegressionEvaluator
object. We also use the totalCost
property of the model's summary object to calculate the sum of squared errors.
Finally, we print both the RMSE and the sum of squared errors to the console.
The computeCost
function is a powerful tool for evaluating the performance of linear regression models in PySpark. By calculating the sum of squared errors, we can determine how well our model fits the data, and make adjustments to improve its accuracy.