📅  最后修改于: 2023-12-03 14:44:53.498000             🧑  作者: Mango
OneHotEncoder is a feature transformer provided by PySpark. It can convert categorical features into binary vectors with a length of the number of categories.
Here is an example of using OneHotEncoder in PySpark:
from pyspark.ml.feature import OneHotEncoder
from pyspark.sql.functions import col
df = spark.createDataFrame([
(0, "male"),
(1, "female"),
(2, "other"),
], ["id", "gender"])
encoder = OneHotEncoder(inputCols=["gender"], outputCols=["gender_vector"])
encoded = encoder.fit(df).transform(df)
encoded.show()
This code creates a DataFrame with two columns: "id" and "gender". It then creates an instance of the OneHotEncoder with "gender" as the input column and "gender_vector" as the output column. Finally, it fits and transforms the DataFrame with the encoder, and shows the resulting DataFrame.
OneHotEncoder has several parameters that can be used to configure its behavior:
inputCol
: The input column to encode. This should be a string column containing categorical values.outputCols
: A list of output column names.dropLast
: Whether to drop the last category in the encoded vector. This is useful to avoid collinearity.handleInvalid
: How to handle invalid input data. Options are "skip", "error", and "keep".OneHotEncoder is a powerful feature transformer that can be used to encode categorical data in PySpark. By creating binary vectors with a length of the number of categories, it allows machine learning algorithms to use categorical data as numeric data.