📜  OneHotEncoder pyspark - Python (1)

📅  最后修改于: 2023-12-03 14:44:53.498000             🧑  作者: Mango

OneHotEncoder pyspark - Python

OneHotEncoder is a feature transformer provided by PySpark. It can convert categorical features into binary vectors with a length of the number of categories.

How to use OneHotEncoder in PySpark

Here is an example of using OneHotEncoder in PySpark:

from pyspark.ml.feature import OneHotEncoder
from pyspark.sql.functions import col

df = spark.createDataFrame([
    (0, "male"),
    (1, "female"),
    (2, "other"),
], ["id", "gender"])

encoder = OneHotEncoder(inputCols=["gender"], outputCols=["gender_vector"])
encoded = encoder.fit(df).transform(df)

encoded.show()

This code creates a DataFrame with two columns: "id" and "gender". It then creates an instance of the OneHotEncoder with "gender" as the input column and "gender_vector" as the output column. Finally, it fits and transforms the DataFrame with the encoder, and shows the resulting DataFrame.

Parameters of OneHotEncoder

OneHotEncoder has several parameters that can be used to configure its behavior:

  • inputCol: The input column to encode. This should be a string column containing categorical values.
  • outputCols: A list of output column names.
  • dropLast: Whether to drop the last category in the encoded vector. This is useful to avoid collinearity.
  • handleInvalid: How to handle invalid input data. Options are "skip", "error", and "keep".
Conclusion

OneHotEncoder is a powerful feature transformer that can be used to encode categorical data in PySpark. By creating binary vectors with a length of the number of categories, it allows machine learning algorithms to use categorical data as numeric data.