pandas python 教程 - Python (1)

📌 相关文章

📜 pandas python 教程 - Python (1)

📅 最后修改于: 2023-12-03 14:45:02.794000 🧑 作者: Mango

pandas python 教程 - Python

本教程将介绍pandas python库的基础知识、使用方法以及适用场景。

什么是pandas?

pandas是一个基于NumPy的开源数据分析库，它可用于数据挖掘、数据分析、数据清洗、数据可视化等领域。

pandas的特点

以表格形式处理数据
快速处理缺失数据
提供多种数据合并和分组工具
可以进行时间序列操作
具有强大的IO工具，可以读取多种数据格式

安装pandas

可以使用pip来安装pandas：

pip install pandas

pandas基础操作

导入pandas

import pandas as pd

创建Series

可以使用以下语句创建一个Series：

s = pd.Series([1,3,5,np.nan,6,8])
print(s)

Output:

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

创建DataFrame

可以使用以下语句创建一个DataFrame：

dates = pd.date_range('20160101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
print(df)

Output:

                   A         B         C         D
2016-01-01 -0.325459 -1.255455  1.756833 -0.985661
2016-01-02  0.157541 -1.094029 -1.162399  1.365767
2016-01-03  0.073426  2.537908 -1.328294 -0.237003
2016-01-04 -1.583608 -0.538704 -0.882628 -0.154218
2016-01-05 -1.048464  1.697358  1.008330  0.859932
2016-01-06  1.527778  1.178987 -0.106862 -0.400041

查看数据

查看DataFrame的头几行数据

print(df.head())

Output:

                   A         B         C         D
2016-01-01 -0.325459 -1.255455  1.756833 -0.985661
2016-01-02  0.157541 -1.094029 -1.162399  1.365767
2016-01-03  0.073426  2.537908 -1.328294 -0.237003
2016-01-04 -1.583608 -0.538704 -0.882628 -0.154218
2016-01-05 -1.048464  1.697358  1.008330  0.859932

查看DataFrame的尾几行数据

print(df.tail())

Output:

                   A         B         C         D
2016-01-02  0.157541 -1.094029 -1.162399  1.365767
2016-01-03  0.073426  2.537908 -1.328294 -0.237003
2016-01-04 -1.583608 -0.538704 -0.882628 -0.154218
2016-01-05 -1.048464  1.697358  1.008330  0.859932
2016-01-06  1.527778  1.178987 -0.106862 -0.400041

查看DataFrame的索引和列

print(df.index)
print(df.columns)

Output:

DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
               '2016-01-05', '2016-01-06'],
              dtype='datetime64[ns]', freq='D')
Index(['A', 'B', 'C', 'D'], dtype='object')

查看DataFrame的数据类型

print(df.dtypes)

Output:

A    float64
B    float64
C    float64
D    float64
dtype: object

数据选择

选择某一列

print(df['A'])

Output:

2016-01-01   -0.325459
2016-01-02    0.157541
2016-01-03    0.073426
2016-01-04   -1.583608
2016-01-05   -1.048464
2016-01-06    1.527778
Freq: D, Name: A, dtype: float64

选择某几行

print(df[0:3])

Output:

                   A         B         C         D
2016-01-01 -0.325459 -1.255455  1.756833 -0.985661
2016-01-02  0.157541 -1.094029 -1.162399  1.365767
2016-01-03  0.073426  2.537908 -1.328294 -0.237003

选择某个区域

print(df.loc['20160102':'20160104',['A','B']])

Output:

                   A         B
2016-01-02  0.157541 -1.094029
2016-01-03  0.073426  2.537908
2016-01-04 -1.583608 -0.538704

选择某个位置的数据

print(df.iloc[3,1])

Output:

-0.538704336731

数据清洗

处理缺失数据

df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
df1.loc[dates[0]:dates[1],'E'] = 1
print(df1)
print(df1.dropna())
print(df1.fillna(value=2))

Output:

                   A         B         C         D    E
2016-01-01 -0.325459 -1.255455  1.756833 -0.985661  1.0
2016-01-02  0.157541 -1.094029 -1.162399  1.365767  1.0
2016-01-03  0.073426  2.537908 -1.328294 -0.237003  NaN
2016-01-04 -1.583608 -0.538704 -0.882628 -0.154218  NaN

                   A         B         C         D    E
2016-01-01 -0.325459 -1.255455  1.756833 -0.985661  1.0
2016-01-02  0.157541 -1.094029 -1.162399  1.365767  1.0

                   A         B         C         D    E
2016-01-01 -0.325459 -1.255455  1.756833 -0.985661  1.0
2016-01-02  0.157541 -1.094029 -1.162399  1.365767  1.0
2016-01-03  0.073426  2.537908 -1.328294 -0.237003  2.0
2016-01-04 -1.583608 -0.538704 -0.882628 -0.154218  2.0

结语

以上是pandas python库的基础知识、使用方法以及适用场景。更多详细内容可以参考pandas官方文档。