在Python中处理时间序列数据
单个主题(实体)在不同时间间隔的观察(活动)集合称为时间序列数据。在度量的情况下,时间序列是等距的,而在事件的情况下,时间序列是不等距的。我们可以为这个 Pandas 模块中的每条记录添加日期和时间,以及获取数据帧记录并发现特定日期和时间范围内的数据。
生成日期范围:
熊猫包已导入。 pd.date_range() 方法用于创建日期范围,该日期范围具有每月频率。
Python3
# importing pandas
import pandas as pd
# creating a date range
Date_range = pd.date_range(start='1/12/2020', end='20/5/2021', freq='M')
print(Date_range)
print(type(Date_range))
print(type(Date_range[0]))
Python3
# importing pandas
import pandas as pd
# creating a date range
Date_range = pd.date_range(start='1/12/2020', end='20/5/2021', freq='M')
# creating a Dataframe
Data = pd.DataFrame(Date_range, columns=['Date'])
# converting the column to datetime
Data['Date'] = pd.to_datetime(Data['Date'])
print(Data.info())
Python3
# importing pandas
import pandas as pd
# creating string data
string_data = ['2020-01-31', '2020-02-29', '2020-03-31', '2020-04-30',
'2020-05-31', '2020-06-30', '2020-07-31', '2020-08-31',
'2020-09-30', '2020-10-31', '2020-11-30', '2020-12-31',
'2021-01-31', '2021-02-28', '2021-03-31', '2021-04-30']
Data = pd.DataFrame(string_data, columns=['Date'])
Data['Date'] = pd.to_datetime(Data['Date'])
print(Data.info())
Python3
# importing pandas
import pandas as pd
from datetime import datetime
# string data
string_data = ['May-20-2021', 'May-21-2021', 'May-22-2021']
timestamp_data = [datetime.strptime(x, '%B-%d-%Y') for x in string_data]
print(timestamp_data)
Data = pd.DataFrame(timestamp_data, columns=['Date'])
print(Data.info())
Python3
# importing pandas
import pandas as pd
# reading csv file
data = pd.read_csv('covid_data.csv')
# converting string data to datetime
data['ObservationDate'] = pd.to_datetime(data['ObservationDate'])
# setting index
data = data.set_index('ObservationDate')
print(data.head())
# indexing and slicing through the dataframe
print(data.loc['2020-01-22'][:10])
Python3
# importing pandas
import pandas as pd
from datetime import datetime
# reading csv file
data = pd.read_csv('covid_data.csv')
# converting string data to datetime
data['ObservationDate'] = pd.to_datetime(data['ObservationDate'])
# setting index
data = data.set_index('ObservationDate')
# indexing and slicing through the dataframe
print(data.loc['2020-01-22':'2020-02-22'])
Python3
# importing pandas
import pandas as pd
from datetime import datetime
# reading csv file
data = pd.read_csv('covid_data.csv')
# converting string data to datetime
data['ObservationDate'] = pd.to_datetime(data['ObservationDate'])
# setting index
data = data.set_index('ObservationDate')
# resampling data according to year
data = data.resample('Y').mean()
print(data)
Python3
# importing pandas
import pandas as pd
from datetime import datetime
# reading csv file
data = pd.read_csv('covid_data.csv')
# converting string data to datetime
data['ObservationDate'] = pd.to_datetime(data['ObservationDate'])
data['Last Update'] = pd.to_datetime(data['Last Update'])
# setting index
data = data.set_index('ObservationDate')
data = data[['Last Update', 'Confirmed']]
data['rolling_sum'] = data.rolling(5).sum()
print(data.head())
Python3
# importing pandas
import pandas as pd
from datetime import datetime
# reading csv file
data = pd.read_csv('covid_data.csv')
# converting string data to datetime
data['ObservationDate'] = pd.to_datetime(data['ObservationDate'])
data['Last Update'] = pd.to_datetime(data['Last Update'])
# setting index
data = data.set_index('ObservationDate')
data = data[['Last Update', 'Confirmed']]
data['rolling_sum'] = data.rolling(5).sum()
print(data.head())
# dealing with missing data
data['rolling_backfilled'] = data['rolling_sum'].fillna(method='backfill')
print(data.head(5))
Python3
# importing pandas
import pandas as pd
from datetime import datetime
# epoch time
epoch = 1598776989
# converting to timestamp
timestamp = pd.to_datetime(epoch, unit='s')
print(timestamp)
# converting it to a particular time zone
print(timestamp.tz_localize('UTC').tz_convert('Europe/Berlin'))
输出:
DatetimeIndex(['2020-01-31', '2020-02-29', '2020-03-31', '2020-04-30',
'2020-05-31', '2020-06-30', '2020-07-31', '2020-08-31',
'2020-09-30', '2020-10-31', '2020-11-30', '2020-12-31',
'2021-01-31', '2021-02-28', '2021-03-31', '2021-04-30'],
dtype='datetime64[ns]', freq='M')
时间戳数据的操作:
日期范围在 pd.DataFrame() 方法的帮助下转换为数据框。使用 to_datetime() 方法将该列转换为 DateTime。如果有任何空值和列的数据类型,info() 方法会提供有关数据框的信息。
Python3
# importing pandas
import pandas as pd
# creating a date range
Date_range = pd.date_range(start='1/12/2020', end='20/5/2021', freq='M')
# creating a Dataframe
Data = pd.DataFrame(Date_range, columns=['Date'])
# converting the column to datetime
Data['Date'] = pd.to_datetime(Data['Date'])
print(Data.info())
输出:
RangeIndex: 16 entries, 0 to 15
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 16 non-null datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 256.0 bytes
将数据从字符串转换为时间戳:
如果我们有一个类似于 DateTime 的字符串数据列表,我们可以首先使用 pd.DataFrame() 方法将其转换为数据帧,然后使用 pd.to_datetime() 方法将其转换为 DateTime 列。
Python3
# importing pandas
import pandas as pd
# creating string data
string_data = ['2020-01-31', '2020-02-29', '2020-03-31', '2020-04-30',
'2020-05-31', '2020-06-30', '2020-07-31', '2020-08-31',
'2020-09-30', '2020-10-31', '2020-11-30', '2020-12-31',
'2021-01-31', '2021-02-28', '2021-03-31', '2021-04-30']
Data = pd.DataFrame(string_data, columns=['Date'])
Data['Date'] = pd.to_datetime(Data['Date'])
print(Data.info())
输出:
RangeIndex: 16 entries, 0 to 15
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 16 non-null datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 256.0 bytes
None
根据我们字符串值的格式,我们可以将它们转换为日期时间。在这种情况下可以使用 datetime.strptime()函数
Python3
# importing pandas
import pandas as pd
from datetime import datetime
# string data
string_data = ['May-20-2021', 'May-21-2021', 'May-22-2021']
timestamp_data = [datetime.strptime(x, '%B-%d-%Y') for x in string_data]
print(timestamp_data)
Data = pd.DataFrame(timestamp_data, columns=['Date'])
print(Data.info())
输出:
[datetime.datetime(2021, 5, 20, 0, 0), datetime.datetime(2021, 5, 21, 0, 0), datetime.datetime(2021, 5, 22, 0, 0)]
RangeIndex: 3 entries, 0 to 2
Data columns (total 1 columns):
# Column Non-Null Count Dtype
— —— ————– —–
0 Date 3 non-null datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 152.0 bytes
切片和索引时间序列数据:
在本示例中导入 CSV 文件,并使用 pd.to_timestamp() 方法将包含字符串数据的列转换为 DateTime。该特定列被设置为索引,可以帮助我们相应地对数据进行切片和索引。数据。 loc['2020-01-22'][:10] 索引 '2020-01-22' 日的数据,并将结果进一步切片以返回当天的前 10 个观察值。
要查看和下载 CSV 文件,请单击此处。
Python3
# importing pandas
import pandas as pd
# reading csv file
data = pd.read_csv('covid_data.csv')
# converting string data to datetime
data['ObservationDate'] = pd.to_datetime(data['ObservationDate'])
# setting index
data = data.set_index('ObservationDate')
print(data.head())
# indexing and slicing through the dataframe
print(data.loc['2020-01-22'][:10])
输出:
Unnamed: 0 Province/State ... Deaths Recovered
ObservationDate ...
2020-01-22 0 Anhui ... 0.0 0.0
2020-01-22 1 Beijing ... 0.0 0.0
2020-01-22 2 Chongqing ... 0.0 0.0
2020-01-22 3 Fujian ... 0.0 0.0
2020-01-22 4 Gansu ... 0.0 0.0
[5 rows x 7 columns]
Unnamed: 0 Province/State ... Deaths Recovered
ObservationDate ...
2020-01-22 0 Anhui ... 0.0 0.0
2020-01-22 1 Beijing ... 0.0 0.0
2020-01-22 2 Chongqing ... 0.0 0.0
2020-01-22 3 Fujian ... 0.0 0.0
2020-01-22 4 Gansu ... 0.0 0.0
2020-01-22 5 Guangdong ... 0.0 0.0
2020-01-22 6 Guangxi ... 0.0 0.0
2020-01-22 7 Guizhou ... 0.0 0.0
2020-01-22 8 Hainan ... 0.0 0.0
2020-01-22 9 Hebei ... 0.0 0.0
[10 rows x 7 columns]
在此示例中,我们将数据从“2020-01-22”切片到“2020-02-22”。
Python3
# importing pandas
import pandas as pd
from datetime import datetime
# reading csv file
data = pd.read_csv('covid_data.csv')
# converting string data to datetime
data['ObservationDate'] = pd.to_datetime(data['ObservationDate'])
# setting index
data = data.set_index('ObservationDate')
# indexing and slicing through the dataframe
print(data.loc['2020-01-22':'2020-02-22'])
输出:
Unnamed: 0 Province/State ... Deaths Recovered
ObservationDate ...
2020-01-22 0 Anhui ... 0.0 0.0
2020-01-22 1 Beijing ... 0.0 0.0
2020-01-22 2 Chongqing ... 0.0 0.0
2020-01-22 3 Fujian ... 0.0 0.0
2020-01-22 4 Gansu ... 0.0 0.0
... ... ... ... ... ...
2020-02-22 2169 San Antonio, TX ... 0.0 0.0
2020-02-22 2170 Seattle, WA ... 0.0 1.0
2020-02-22 2171 Tempe, AZ ... 0.0 0.0
2020-02-22 2172 Unknown ... 0.0 0.0
2020-02-22 2173 NaN ... 0.0 0.0
[2174 rows x 7 columns]
为不同时间段的各种聚合/汇总统计重新采样时间序列数据:
要重新采样时间序列数据,请使用 pandas resample()函数。它是一种时间序列频率转换和重采样的便利技术。如果对象具有类似日期时间的索引,则调用者必须将类似日期时间的系列/索引的标签提供给 on/level 关键字参数。
Python3
# importing pandas
import pandas as pd
from datetime import datetime
# reading csv file
data = pd.read_csv('covid_data.csv')
# converting string data to datetime
data['ObservationDate'] = pd.to_datetime(data['ObservationDate'])
# setting index
data = data.set_index('ObservationDate')
# resampling data according to year
data = data.resample('Y').mean()
print(data)
输出:
Unnamed: 0 Confirmed Deaths Recovered
ObservationDate
2020-12-31 96232.5 39696.116550 1160.959453 24659.893368
2021-12-31 249447.0 163315.277678 3514.893386 93925.632661
计算滚动统计量,如滚动平均值:
使用 Pandas 创建的数据框。 rolling() 方法允许您计算滚动窗口。计算滚动窗口的想法最常用于信号处理和时间序列数据。换句话说,我们一次取一个大小为 k 的窗口,并对其进行一些数学运算。大小为 k 的窗口表示同时显示 k 个连续值。在最简单的情况下,所有“k”值的权重相同。在下面的示例中,窗口大小为 5。
Python3
# importing pandas
import pandas as pd
from datetime import datetime
# reading csv file
data = pd.read_csv('covid_data.csv')
# converting string data to datetime
data['ObservationDate'] = pd.to_datetime(data['ObservationDate'])
data['Last Update'] = pd.to_datetime(data['Last Update'])
# setting index
data = data.set_index('ObservationDate')
data = data[['Last Update', 'Confirmed']]
data['rolling_sum'] = data.rolling(5).sum()
print(data.head())
输出:
Last Update Confirmed rolling_sum
ObservationDate
2020-01-22 2020-01-22 17:00:00 1.0 NaN
2020-01-22 2020-01-22 17:00:00 14.0 NaN
2020-01-22 2020-01-22 17:00:00 6.0 NaN
2020-01-22 2020-01-22 17:00:00 1.0 NaN
2020-01-22 2020-01-22 17:00:00 0.0 22.0
处理缺失数据:
在前面的示例中,rolling_sum 列具有 Nan 值,因此我们可以使用该数据来演示如何处理缺失数据。
当 CSV 文件包含空值时,空值在数据框中显示为 NaN。 Fillna() 处理并允许用户用他们自己的值替换 NaN 值,类似于 pandas dropna()函数如何从数据框中维护和删除 Null 值。向后填充数据框中的缺失值是通过将回填作为方法参数值传递给 fillna() 来完成的。 Fillna() 通过将 ffill 作为方法参数值传递,以正向填充数据帧中的缺失值。
Python3
# importing pandas
import pandas as pd
from datetime import datetime
# reading csv file
data = pd.read_csv('covid_data.csv')
# converting string data to datetime
data['ObservationDate'] = pd.to_datetime(data['ObservationDate'])
data['Last Update'] = pd.to_datetime(data['Last Update'])
# setting index
data = data.set_index('ObservationDate')
data = data[['Last Update', 'Confirmed']]
data['rolling_sum'] = data.rolling(5).sum()
print(data.head())
# dealing with missing data
data['rolling_backfilled'] = data['rolling_sum'].fillna(method='backfill')
print(data.head(5))
输出:
Last Update Confirmed rolling_sum
ObservationDate
2020-01-22 2020-01-22 17:00:00 1.0 NaN
2020-01-22 2020-01-22 17:00:00 14.0 NaN
2020-01-22 2020-01-22 17:00:00 6.0 NaN
2020-01-22 2020-01-22 17:00:00 1.0 NaN
2020-01-22 2020-01-22 17:00:00 0.0 22.0
Last Update Confirmed rolling_sum rolling_backfilled
ObservationDate
2020-01-22 2020-01-22 17:00:00 1.0 NaN 22.0
2020-01-22 2020-01-22 17:00:00 14.0 NaN 22.0
2020-01-22 2020-01-22 17:00:00 6.0 NaN 22.0
2020-01-22 2020-01-22 17:00:00 1.0 NaN 22.0
2020-01-22 2020-01-22 17:00:00 0.0 22.0 22.0
Unix/纪元时间的基础知识:
在处理时间序列数据时,可能会遇到 Unix 时间的时间值。自 1970 年 1 月 1 日星期四 00:00:00 协调世界时 (UTC) 起的秒数称为 Unix 时间,有时也称为纪元时间。 Unix 时间帮助我们破译时间戳,这样我们就不会被时区、夏令时和其他因素所迷惑。
在下面的示例中,我们使用 pd.to_timestamp() 方法将纪元时间转换为时间戳。如果我们想要 UTC 时间到特定时区,tz_localize() 和 tz。使用 convert() 方法。在下面的示例中,我们将其转换为“欧洲/柏林”时区。
Python3
# importing pandas
import pandas as pd
from datetime import datetime
# epoch time
epoch = 1598776989
# converting to timestamp
timestamp = pd.to_datetime(epoch, unit='s')
print(timestamp)
# converting it to a particular time zone
print(timestamp.tz_localize('UTC').tz_convert('Europe/Berlin'))
输出:
2020-08-30 08:43:09
2020-08-30 10:43:09+02:00