Python Pandas-分类数据(1)

📌 相关文章

📜 Python Pandas-分类数据(1)

📅 最后修改于: 2023-12-03 15:34:03.159000 🧑 作者: Mango

Python Pandas-分类数据

Pandas 是一个常用的数据处理库，它提供了分类数据的特殊数据类型实现。这些数据类型允许你在处理大量重复数据时，能够高效的计算和存储数据。

为什么需要分类数据类型？

分类数据经常会出现在数据分析中，例如性别、区域、行业等。这些数据的取值只有有限的几个，并且重复出现多次。在数据分析中，经常需要统计这些数据的数量或者计算它们的平均值、最大值或最小值等；这时，如果将此类数据存储为普通的字符串类型，会占用较多的存储空间，并且在计算时需要循环整个数据集，速度极慢。

相比之下，分类数据类型的优点在于：

明显更加高效，存储分类数据只需要存储每个不同值出现的次数，而不是原始数据的字符串或数字。
使用分类数据和前面的数据类型相比，进行分组和排序是更加快速的。因为分类数据的本质是整数数据类型，因此可以充分地利用整数排序的性质。

现在，让我们看一下分类数据类型的用法。

分类数据类型的创建

可以通过 pd.Categorical() 函数创建 Pandas 中的分类数据类型。对于一个普通的列表或数组，可以使用 pd.Categorical() 函数，将其转换为分类数据：

import pandas as pd
import numpy as np

# 创建一个列表
fruits = ['apple', 'orange', 'apple', 'apple'] * 2

# 将其转换为分类数据类型
fruits_cat = pd.Categorical(fruits)
print(fruits_cat)

输出结果：

[apple, orange, apple, apple, apple, orange, apple, apple]
Categories (2, object): [apple, orange]

上述代码中，通过 pd.Categorical() 函数将列表 fruits 转换为了分类数据类型。输出结果中，第一行是列表 fruits 经过 pd.Categorical() 函数转换后的结果，其中出现的 Categories (2, object) 指的是 fruits_cat 中包含 2 个不同的分类数据，分别是 apple 和 orange。

分类数据的属性

上述代码中，我们已经创建了一个分类数据类型 objects，作为一个 Series 或 DataFrame 的列，可以直接使用它。

下面是一些分类数据类型的属性：

categories：该属性返回分类数据中的所有类别。对于上面的代码来说，类别就是 ['apple', 'orange']。
codes：该属性返回每个元素对应的整数代码。在上述代码中，对于数据中的元素来说，类别 apple 对应 0，类别 orange 对应 1。

让我们看一下这些属性的用法：

# 输出类别
print(fruits_cat.categories)

# 输出代码
print(fruits_cat.codes)

输出结果：

Index(['apple', 'orange'], dtype='object')
[0 1 0 0 0 1 0 0]

将数据转换为分类数据类型

一个已经存在的 Series 或 DataFrame 列可以通过调用 astype() 方法，并将其数据类型设置为 'category' 来转换为分类数据类型。

# 将列表转换为 Series 对象
fruits_series = pd.Series(fruits)

# 将 Series 对象转换为分类数据类型
fruits_series_cat = fruits_series.astype('category')

# 输出分类数据类型的所有类别和代码
print(fruits_series_cat.cat.categories)
print(fruits_series_cat.cat.codes)

输出结果：

Index(['apple', 'orange'], dtype='object')
0    0
1    1
2    0
3    0
4    0
5    1
6    0
7    0
dtype: int8

分类数据类型的逆变换

如果想要将分类数据转回原始的 Series 或 DataFrame 类型，可以使用 astype() 方法。

# 将分类数据转回原始的 Series 类型
fruits_series_des = fruits_series_cat.astype('object')

# 输出原始的 Series 类型
print(fruits_series_des)

输出结果：

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
dtype: object

分类数据类型中的缺失值

Pandas 中的分类数据类型是可选的，包括备选的缺失值表示方法。使用缺失值的分类数据表现形式是 NaN。

# 创建一个包含缺失值的 Series
fruits_with_nan = pd.Series(['apple', 'orange', np.nan, 'apple'] * 2)

# 将其转换为分类数据类型
fruits_with_nan_cat = fruits_with_nan.astype('category')

# 输出分类数据类型的所有类别和代码
print(fruits_with_nan_cat.cat.categories)
print(fruits_with_nan_cat.cat.codes)

输出结果：

Index(['apple', 'orange'], dtype='object')
0    0
1    1
2   -1
3    0
4    0
5    1
6   -1
7    0
dtype: int8

分类数据类型中的操作

分类数据类型支持大部分与字符串类型相似的方法，例如 len()、in 运算符等。

# 创建一个包含重复元素的 Series
s = pd.Series(['apple', 'orange', 'apple', 'banana'] * 2)

# 将其转换为分类数据类型
s = s.astype('category')

# 统计数据中每个值出现的次数
print(s.value_counts())

输出结果：

apple     4
banana    2
orange    2
dtype: int64

分类数据类型还支持多数数组操作，例如切片和向量化操作。

# 获取数据中前面 3 个元素
s[:3]

输出结果：

0     apple
1    orange
2     apple
dtype: category
Categories (3, object): [apple, banana, orange]

代码片段：