如何在Python中重新采样时间序列数据？

在时间序列中，数据一致性至关重要，重采样可确保数据以一致的频率分布。重采样还可以提供查看数据的不同感知，换句话说，它可以根据重采样频率添加有关数据的额外见解。

resample()函数：主要用于时间序列数据。

句法：

# import the python pandas library
import pandas as pd

# syntax for the resample function.
pd.series.resample(rule, axis=0, closed='left',
 convention='start', kind=None, offset=None,
 origin='start_day')

重采样主要涉及改变原始观察的时间频率。两种流行的时间序列重采样方法如下

上采样
下采样

上采样

上采样涉及增加数据的时间频率，这是一个数据分解过程，我们将时间频率从较高级别分解为较低级别。例如，将时间频率从几个月分解为几天，或将几天分解为几小时或几小时分解为几秒钟。上采样通常会放大数据的大小，具体取决于采样频率。如果 D 是原始数据的大小，D' 是上采样数据的大小，则 D' > D

现在，让我们看一个使用Python对时间序列数据执行重采样的示例。

单击此处下载用于实施的实践数据集 Detergent sales data.csv。

例子：

Python3

# import the python pandas library
import pandas as pd
  
# read data using read_csv
data = pd.read_csv("Detergent sales data.csv", header=0,
                   index_col=0, parse_dates=True, squeeze=True)

Python3

# Use resample function to upsample months 
# to days using the mean sales of month
upsampled = data.resample('D').mean()

Python3

# use interpolate function with method linear
# to upsample the values of the upsampled days 
# linearly
interpolated = upsampled.interpolate(method='linear')
  
# Printing the linear interpolated values for month 2
print(interpolated['2021-02']) .

Python3

# use interpolate function with method polynomial
# This upsamples the values of the remaining
# days with a quadratic function of degree 2.
interpolated = upsampled.interpolate(method='polynomial', order=2)
  
# Printing the polynomial interpolated value
print(interpolated)

Python3

# import the python pandas library
import pandas as pd
  
# read the data using pandas read_csv() function.
data = pd.read_csv("car-sales.csv", header=0, 
                   index_col=0, parse_dates=True,
                   squeeze=True)
# printing the first 6 rows of the dataset
print(data.head(6))

Python3

# Use resample function to downsample days
# to months using the mean sales of month.
downsampled = data.resample('Q').mean()
  
# printing the downsampled data.
print(downsampled)

输出：

洗涤剂销售数据显示前 6 个月的销售价值。假设这里的任务是预测每日销售额的价值。给定月度数据，我们被要求预测每日销售数据，这表明使用了上采样。

Python3

# Use resample function to upsample months 
# to days using the mean sales of month
upsampled = data.resample('D').mean()

输出：

输出显示了数据集的一些样本，这些样本根据月份的平均值从几个月到几天进行了上采样。您也可以尝试使用最适合该问题的 sum()、median()。

除了最初在我们的数据集中可用的那些日子外，其余日子的数据集已用 nan 值进行了上采样。（每个月的总销售数据）。

现在，我们可以使用一种称为插值的技术来填充这些 nan 值。为此，Pandas 提供了一个名为 DataFrame.interpolate() 的函数。插值是一种涉及使用最接近的技术之一填充 nan 值的方法，“零”，“线性”，“二次”，“三次”，“样条”，“重心”，“多项式”。我们将选择“线性”插值。这会在可用数据之间绘制一条直线，在这种情况下是在本月的最后一天，并从这条线上以选定的频率填充值。

Python3

# use interpolate function with method linear
# to upsample the values of the upsampled days 
# linearly
interpolated = upsampled.interpolate(method='linear')
  
# Printing the linear interpolated values for month 2
print(interpolated['2021-02']) .

输出：

使用多项式插值进行上采样

另一种常见的插值方法是使用多项式或样条连接这些值。这会创建更多曲线，并且在许多数据集上看起来很逼真。使用样条插值需要您指定阶数（多项式中的项数）。

Python3

# use interpolate function with method polynomial
# This upsamples the values of the remaining
# days with a quadratic function of degree 2.
interpolated = upsampled.interpolate(method='polynomial', order=2)
  
# Printing the polynomial interpolated value
print(interpolated)

输出：

因此，我们可以使用 resample() 和 interpolate()函数对数据进行上采样。尝试使用这些功能的不同配置。

下采样：

下采样涉及降低数据的时间频率，它是一种数据聚合过程，我们将时间频率从较低级别聚合到较高级别。例如，将时间频率从几天到几个月，或从几小时到几天，从几秒到几小时。下采样通常会缩小数据的大小，具体取决于采样频率。如果 D 是原始数据的大小，D' 是上采样数据的大小，则 D' < D。

例如，汽车销售数据按天显示前 6 个月的销售价值。假设这里的任务是预测季度销售额的价值。给定每日数据，我们被要求预测季度销售数据，这表明使用了下采样。

单击此处下载此实施中使用的练习数据集 car-sales.csv。

例子：

Python3

# import the python pandas library
import pandas as pd
  
# read the data using pandas read_csv() function.
data = pd.read_csv("car-sales.csv", header=0, 
                   index_col=0, parse_dates=True,
                   squeeze=True)
# printing the first 6 rows of the dataset
print(data.head(6))

输出：

我们可以使用季度重采样频率“Q”来按季度汇总数据。

Python3

# Use resample function to downsample days
# to months using the mean sales of month.
downsampled = data.resample('Q').mean()
  
# printing the downsampled data.
print(downsampled)

输出：

现在，该下采样数据可用于预测季度销售额。