📅  最后修改于: 2023-12-03 15:17:58.524000             🧑  作者: Mango
When working with data in pandas, we often pre-process the data to make it fit for analysis. One such pre-processing step is normalization. Normalization is the process of scaling the values of a feature so that they fall within a specific range. This helps in preventing features with very large or very small values from dominating the analysis.
In Pandas, the normalize
parameter is used for normalizing the data. If set to True
, it scales all values in each row to the interval [0,1], giving us relative frequencies.
The syntax for using the normalize parameter in pandas is as follows:
DataFrame.sum(axis=None, skipna=None, level=None, numeric_only=None, normalize=None)
import pandas as pd
data = {'city': ['New York', 'New York', 'Boston', 'Boston', 'Atlanta', 'Atlanta'],
'temperature': [25, 28, 21, 25, 30, 35],
'humidity': [60, 65, 70, 75, 80, 85]}
df = pd.DataFrame(data)
print("The dataframe before normalization:")
print(df)
# Normalize the temperature and humidity columns
normalized_df = (df[['temperature', 'humidity']] - df[['temperature', 'humidity']].min()) / (df[['temperature', 'humidity']].max() - df[['temperature', 'humidity']].min())
print("\nThe dataframe after normalization (using manual calculation):")
print(normalized_df)
# Using normalize=True in pandas
normalized_df = df[['temperature', 'humidity']].div(df[['temperature', 'humidity']].sum(axis=1), axis=0)
normalized_df['city'] = df['city']
print("\nThe dataframe after normalization using normalize=True")
print(normalized_df)
Output:
The dataframe before normalization:
city temperature humidity
0 New York 25 60
1 New York 28 65
2 Boston 21 70
3 Boston 25 75
4 Atlanta 30 80
5 Atlanta 35 85
The dataframe after normalization (using manual calculation):
temperature humidity
0 0.333333 0.000000
1 0.500000 0.333333
2 0.000000 0.666667
3 0.333333 1.000000
4 0.666667 0.000000
5 1.000000 0.333333
The dataframe after normalization using normalize=True
temperature humidity city
0 0.294118 0.705882 New York
1 0.430769 0.569231 New York
2 0.230769 0.769231 Boston
3 0.250000 0.750000 Boston
4 0.272727 0.727273 Atlanta
5 0.291667 0.708333 Atlanta
Normalizing data is an essential pre-processing task when performing data analysis. Pandas provides several ways to normalize data, and the normalize
parameter is one such way. By setting it to True
, we can scale all values in each row to the interval [0,1], giving us relative frequencies.