如何使用Python检查时间序列数据是否是平稳的?
时间序列数据通常以其时间性质为特征。这种时间特性为数据添加了趋势或季节性,使其与时间序列分析和预测兼容。如果时间序列数据不随时间变化或没有时间结构,则称其为静止的。因此,非常有必要检查数据是否平稳。在时间序列预测中,如果数据是静止的,我们就无法从数据中获得有价值的见解。
固定数据的示例图:
平稳性类型:
当谈到识别数据是否平稳时,这意味着识别数据中的细粒度的平稳性概念。在时间序列数据中观察到的平稳性类型包括
- 趋势平稳 –不显示趋势的时间序列。
- 季节性固定 -不显示季节性变化的时间序列。
- Strictly Stationary –观测值的联合分布对时移是不变的。
逐步实施
以下步骤将让用户轻松理解检查给定时间序列数据是否平稳的方法。
第 1 步:绘制时间序列数据
单击此处下载练习数据集 daily-female-births-IN.csv。
Python3
# import python pandas library
import pandas as pd
# import python matplotlib library for plotting
import matplotlib.pyplot as plt
# read the dataset using pandas read_csv()
# function
data = pd.read_csv("daily-total-female-births-IN.csv",
header=0, index_col=0)
# use simple line plot to see the distribution
# of the data
plt.plot(data)
Python3
# import python pandas library
import pandas as pd
# import python matplotlib library for
# plotting
import matplotlib.pyplot as plt
# read the dataset using pandas read_csv()
# function
data = pd.read_csv("AirPassengers.csv",
header=0, index_col=0)
# print the first 6 rows of data
print(data.head(10))
# use simple line plot to understand the
# data distribution
plt.plot(data)
Python3
# import the python pandas library
import pandas as pd
# use pandas read_csv() function to read the dataset.
data = pd.read_csv("AirPassengers.csv", header=0, index_col=0)
# extracting only the air passengers count from
# the dataset using values function
values = data.values
# getting the count to split the dataset into 3
parts = int(len(values)/3)
# splitting the data into three parts
part_1, part_2, part_3 = values[0:parts], values[parts:(
parts*2)], values[(parts*2):(parts*3)]
# calculating the mean of the separated three
# parts of data individually.
mean_1, mean_2, mean_3 = part_1.mean(), part_2.mean(), part_3.mean()
# calculating the variance of the separated
# three parts of data individually.
var_1, var_2, var_3 = part_1.var(), part_2.var(), part_3.var()
# printing the mean of three groups
print('mean1=%f, mean2=%f, mean2=%f' % (mean_1, mean_2, mean_3))
# printing the variance of three groups
print('variance1=%f, variance2=%f, variance2=%f' % (var_1, var_2, var_3))
Python3
# import python pandas library
import pandas as pd
# import python matplotlib library for plotting
import matplotlib.pyplot as plt
# import python numpy library
import numpy as np
# read the dataset using pandas read_csv()
# function
data = pd.read_csv("AirPassengers.csv", header=0, index_col=0)
# extracting only the air passengers count
# from the dataset using values function
values = log(data.values)
# printing the first 15 passenger count values
print(values[0:15])
# using simple line plot to understand the
# data distribution
plt.plot(values)
Python3
# getting the count to split the dataset
# into 3 parts
parts = int(len(values)/3)
# splitting the data into three parts.
part_1, part_2, part_3 = values[0:parts], values[parts:(parts*2)], values[(parts*2):(parts*3)]
# calculating the mean of the separated three
# parts of data individually.
mean_1, mean_2, mean_3 = part_1.mean(), part_2.mean(), part_3.mean()
# calculating the variance of the separated three
# parts of data individually.
var_1, var_2, var_3 = part_1.var(), part_2.var(), part_3.var()
# printing the mean of three groups
print('mean1=%f, mean2=%f, mean2=%f' % (mean_1, mean_2, mean_3))
# printing the variance of three groups
print('variance1=%f, variance2=%f, variance2=%f' % (var_1, var_2, var_3))
Python3
# import python pandas package
import pandas as pd
# import the adfuller function from statsmodel
# package to perform ADF test
from statsmodels.tsa.stattools import adfuller
# read the dataset using pandas read_csv() function
data = pd.read_csv("AirPassengers.csv", header=0, index_col=0)
# extracting only the passengers count using values function
values = data.values
# passing the extracted passengers count to adfuller function.
# result of adfuller function is stored in a res variable
res = adfuller(values)
# Printing the statistical result of the adfuller test
print('Augmneted Dickey_fuller Statistic: %f' % res[0])
print('p-value: %f' % res[1])
# printing the critical values at different alpha levels.
print('critical values at different levels:')
for k, v in res[4].items():
print('\t%s: %.3f' % (k, v))
Python3
# import python pandas package
import pandas as pd
# import the adfuller function from statsmodel
# package to perform ADF test
from statsmodels.tsa.stattools import adfuller
# import python numpy package
import numpy as np
# read the dataset using pandas read_csv() function
data = pd.read_csv("AirPassengers.csv", header=0, index_col=0)
# extracting only the passengers count using
# values function and applying log transform on it.
values = log(data.values)
# passing the extracted passengers count to adfuller function.
# result of adfuller function is stored in a res variable
res = adfuller(values)
# Printing the statistical result of the adfuller test
print('Augmneted Dickey_fuller Statistic: %f' % res[0])
print('p-value: %f' % res[1])
# printing the critical values at different alpha levels.
print('critical values at different levels:')
for k, v in res[4].items():
print('\t%s: %.3f' % (k, v))
输出:
第 2 步:评估描述性统计数据
这通常通过将数据分成两个或多个分区并计算每个组的均值和方差来完成。如果这些一阶矩在这些分区之间是一致的,那么我们可以假设数据是平稳的。让我们使用 1949 年至 1960 年间的航空公司乘客人数数据集。
单击此处下载练习数据集 AirPassengers.csv。
Python3
# import python pandas library
import pandas as pd
# import python matplotlib library for
# plotting
import matplotlib.pyplot as plt
# read the dataset using pandas read_csv()
# function
data = pd.read_csv("AirPassengers.csv",
header=0, index_col=0)
# print the first 6 rows of data
print(data.head(10))
# use simple line plot to understand the
# data distribution
plt.plot(data)
输出:
现在,让我们将这些数据分成不同的组,计算不同组的均值和方差并检查一致性。
Python3
# import the python pandas library
import pandas as pd
# use pandas read_csv() function to read the dataset.
data = pd.read_csv("AirPassengers.csv", header=0, index_col=0)
# extracting only the air passengers count from
# the dataset using values function
values = data.values
# getting the count to split the dataset into 3
parts = int(len(values)/3)
# splitting the data into three parts
part_1, part_2, part_3 = values[0:parts], values[parts:(
parts*2)], values[(parts*2):(parts*3)]
# calculating the mean of the separated three
# parts of data individually.
mean_1, mean_2, mean_3 = part_1.mean(), part_2.mean(), part_3.mean()
# calculating the variance of the separated
# three parts of data individually.
var_1, var_2, var_3 = part_1.var(), part_2.var(), part_3.var()
# printing the mean of three groups
print('mean1=%f, mean2=%f, mean2=%f' % (mean_1, mean_2, mean_3))
# printing the variance of three groups
print('variance1=%f, variance2=%f, variance2=%f' % (var_1, var_2, var_3))
输出:
输出清楚地表明三组的均值和方差彼此之间存在很大差异,说明数据是非平稳的。例如,如果mean_1 = 150,mean_2 = 160,mean_3 = 155和variance_1 = 33,variance_2 = 35,variance_3 = 37的平均值,那么我们可以得出结论数据是平稳的。有时这种方法对于某些分布可能会失败,例如对数范数分布。
让我们尝试与上面相同的示例,但使用 NumPy 的 log()函数获取乘客计数的日志并检查结果。
Python3
# import python pandas library
import pandas as pd
# import python matplotlib library for plotting
import matplotlib.pyplot as plt
# import python numpy library
import numpy as np
# read the dataset using pandas read_csv()
# function
data = pd.read_csv("AirPassengers.csv", header=0, index_col=0)
# extracting only the air passengers count
# from the dataset using values function
values = log(data.values)
# printing the first 15 passenger count values
print(values[0:15])
# using simple line plot to understand the
# data distribution
plt.plot(values)
输出:
输出表明存在一些趋势,但不像前面的情况那样非常陡峭,现在让我们计算分区均值和方差。
Python3
# getting the count to split the dataset
# into 3 parts
parts = int(len(values)/3)
# splitting the data into three parts.
part_1, part_2, part_3 = values[0:parts], values[parts:(parts*2)], values[(parts*2):(parts*3)]
# calculating the mean of the separated three
# parts of data individually.
mean_1, mean_2, mean_3 = part_1.mean(), part_2.mean(), part_3.mean()
# calculating the variance of the separated three
# parts of data individually.
var_1, var_2, var_3 = part_1.var(), part_2.var(), part_3.var()
# printing the mean of three groups
print('mean1=%f, mean2=%f, mean2=%f' % (mean_1, mean_2, mean_3))
# printing the variance of three groups
print('variance1=%f, variance2=%f, variance2=%f' % (var_1, var_2, var_3))
输出:
理想情况下,我们预计均值和方差会非常不同,但它们是相同的,在这种情况下,这种方法可能会非常失败。为了避免这种情况,我们有另一个统计测试,将在下面讨论。
第 3 步:增强的 Dickey-Fuller 检验
这是一个统计测试,专门用于测试单变量时间序列数据是否平稳。该测试基于一个假设,可以告诉我们可以接受它的概率程度。它通常被归类为单位根检验之一,它决定了单变量时间序列数据遵循趋势的强度。让我们定义零假设和替代假设,
- Ho (Null Hypothesis):时间序列数据是非平稳的
- H1(替代假设):时间序列数据是平稳的
假设alpha = 0.05,意思是(95% 置信度)。如果 p > 0.05 不能拒绝原假设,则用 p 值解释检验结果,否则如果 p <= 0.05 拒绝原假设。现在,让我们使用相同的航空乘客数据集并使用 stats 模型包提供的adfuller()统计函数对其进行测试,以检查数据是否静止。
Python3
# import python pandas package
import pandas as pd
# import the adfuller function from statsmodel
# package to perform ADF test
from statsmodels.tsa.stattools import adfuller
# read the dataset using pandas read_csv() function
data = pd.read_csv("AirPassengers.csv", header=0, index_col=0)
# extracting only the passengers count using values function
values = data.values
# passing the extracted passengers count to adfuller function.
# result of adfuller function is stored in a res variable
res = adfuller(values)
# Printing the statistical result of the adfuller test
print('Augmneted Dickey_fuller Statistic: %f' % res[0])
print('p-value: %f' % res[1])
# printing the critical values at different alpha levels.
print('critical values at different levels:')
for k, v in res[4].items():
print('\t%s: %.3f' % (k, v))
输出:
根据我们的假设,ADF 统计量远大于不同级别的临界值,并且 p 值也大于 0.05,这意味着我们无法拒绝 90%、95% 和 99 的原假设% 置信度,意味着时间序列数据非常不稳定。
现在,让我们尝试对日志规范值运行 ADF 测试并交叉检查我们的结果。
Python3
# import python pandas package
import pandas as pd
# import the adfuller function from statsmodel
# package to perform ADF test
from statsmodels.tsa.stattools import adfuller
# import python numpy package
import numpy as np
# read the dataset using pandas read_csv() function
data = pd.read_csv("AirPassengers.csv", header=0, index_col=0)
# extracting only the passengers count using
# values function and applying log transform on it.
values = log(data.values)
# passing the extracted passengers count to adfuller function.
# result of adfuller function is stored in a res variable
res = adfuller(values)
# Printing the statistical result of the adfuller test
print('Augmneted Dickey_fuller Statistic: %f' % res[0])
print('p-value: %f' % res[1])
# printing the critical values at different alpha levels.
print('critical values at different levels:')
for k, v in res[4].items():
print('\t%s: %.3f' % (k, v))
输出:
如您所见,ADF 检验再次表明 ADF 统计量远大于不同级别的临界值,并且 p 值远大于 0.05,这意味着我们无法拒绝零假设90%、95% 和 99% 的置信度,这意味着时间序列数据具有很强的非平稳性。
因此,ADF 单位根检验是检验时间序列数据是否平稳的稳健检验。