decompose() for time series: ValueError: You must specify a period or x must be a pandas object with a DatetimeIndex with a freq not set to None

Question

我在正确执行 additive 模型时遇到了一些问题。我有以下数据框：

当我运行这个代码时：

import statsmodels as sm
import statsmodels.api as sm
decomposition = sm.tsa.seasonal_decompose(df, model = 'additive')
fig = decomposition.plot()
matplotlib.rcParams['figure.figsize'] = [9.0,5.0]

我收到了这条消息：

ValueError: You must specify a period or x must be a pandas object with a >DatetimeIndex with a freq not set to None

我应该怎么做才能得到那个例子：

上面的屏幕是我从这个place

截取的

Answer 1

为了解决这个问题，我执行了 sort_index 并且上面的代码有效

df.sort_index(inplace= True)

Answer 2

具有相同的 ValueError，这只是一些测试和我自己的少量研究的结果，并不声称它是完整的或专业的。发现不对的请评论或回答。

当然，您的数据应该按照索引值的正确顺序排列，正如您在回答中所述，您可以使用 df.sort_index(inplace=True) 来确保这一点。这本身并没有错，尽管错误消息与排序顺序无关，而且我已经检查过：当我对手头的一个巨大数据集的索引进行排序时，错误并没有消失。是的，我还必须对 df.index 进行排序，但是 decompose() 也可以处理未排序的数据，其中项目会及时跳来跳去：然后你只会得到很多从左到右的蓝线来回，直到整个图表都充满了它。而且，通常情况下，无论如何排序已经是正确的。就我而言，排序无助于修复错误。因此，我也怀疑索引排序是否已修复您的案例中的错误，因为：错误实际上说明了什么？

ValueError：您必须指定：

[任一]句号
或 x 必须是 pandas 对象，其 DatetimeIndex 的频率未设置为 None

首先，如果你有一个 列表列 以便你的时间序列嵌套到现在，请参阅了解如何取消嵌套 列出列。这对于 1.) 和 2.) 都是需要的。

1. 的详细信息：“您必须指定 [任一] 句点...”

期间的定义

“句点，整数，可选”来自https://www.statsmodels.org/stable/generated/statsmodels.tsa.seasonal.seasonal_decompose.html：

Period of the series. Must be used if x is not a pandas object or if the index of x does not have a frequency. Overrides default periodicity of x if x is a pandas object with a timeseries index.

用整数设置的周期参数表示您希望在数据中的周期数。如果你有一个 1000 行的 df，其中有一个 list column（称之为 df_nested），并且每个列表有例如 100 个元素，那么你将有 100 个元素循环。采取 period = len(df_nested)（= 周期数）以获得季节性和趋势的最佳分割可能是明智的。如果每个周期的元素随时间变化，其他值可能更好。我不确定如何正确设置参数，因此尚未回答 Cross Validated 上的问题 statsmodels seasonal_decompose(): What is the right “period of the series” in the context of a list column (constant vs. varying number of items)。

选项 1 的“期间”参数。) 比选项 2 有很大优势。)。尽管它为其 x-axis 使用时间索引 (DatetimeIndex)，但与选项 2 相比，它不需要项目准确命中频率。)。相反，它只是将一行中的任何内容连接在一起，优点是您不需要填补任何空白：前一个事件的最后一个值只是与下一个事件的下一个值连接，无论它是否已经在下一秒或第二天。

“周期”的最大可能值是多少？如果你有一个列表列（再次调用df“df_nested”），你应该首先unnest list column to a normal column。最大周期为 len(df_unnested)/2.

示例1：x中的20个项目（x是df_unnested中所有项目的数量）最多可以有period = 10.

示例 2：拥有 20 个项目并取 period=20，这会引发以下错误：

ValueError: x must have 2 complete cycles requires 40 observations. x only has 20 observation(s)

另一个side-note：要消除问题中的错误，period = 1应该已经把它去掉了，但是对于时间序列分析，“=1”并没有揭示任何新的东西，每个周期只是1个项目，趋势是一样的原始数据，季节性为0，残差始终为0。

####

例子借自

df_test = pd.DataFrame({'timestamp': [1462352000000000000, 1462352100000000000, 1462352200000000000, 1462352300000000000],
                'listData': [[1,2,1,9], [2,2,3,0], [1,3,3,0], [1,1,3,9]],
                'duration_sec': [3.0, 3.0, 3.0, 3.0]})
tdi = pd.DatetimeIndex(df_test.timestamp)
df_test.set_index(tdi, inplace=True)
df_test.drop(columns='timestamp', inplace=True)
df_test.index.name = 'datetimeindex'

df_test = df_test.explode('listData') 
sizes = df_test.groupby(level=0)['listData'].transform('size').sub(1)
duration = df_test['duration_sec'].div(sizes)
df_test.index += pd.to_timedelta(df_test.groupby(level=0).cumcount() * duration, unit='s')

生成的 df_test['listData'] 如下所示：

2016-05-04 08:53:20    1
2016-05-04 08:53:21    2
2016-05-04 08:53:22    1
2016-05-04 08:53:23    9
2016-05-04 08:55:00    2
2016-05-04 08:55:01    2
2016-05-04 08:55:02    3
2016-05-04 08:55:03    0
2016-05-04 08:56:40    1
2016-05-04 08:56:41    3
2016-05-04 08:56:42    3
2016-05-04 08:56:43    0
2016-05-04 08:58:20    1
2016-05-04 08:58:21    1
2016-05-04 08:58:22    3
2016-05-04 08:58:23    9

现在看看不同时期的整数值。

period = 1:

result_add = seasonal_decompose(x=df_test['listData'], model='additive', extrapolate_trend='freq', period=1)
plt.rcParams.update({'figure.figsize': (5,5)})
result_add.plot().suptitle('Additive Decompose', fontsize=22)
plt.show()

period = 2:

result_add = seasonal_decompose(x=df_test['listData'], model='additive', extrapolate_trend='freq', period=2)
plt.rcParams.update({'figure.figsize': (5,5)})
result_add.plot().suptitle('Additive Decompose', fontsize=22)
plt.show()

如果你把所有项目的四分之一作为一个循环，这里是 4（共 16 个项目）。

period = 4:

result_add = seasonal_decompose(x=df_test['listData'], model='additive', extrapolate_trend='freq', period=int(len(df_test)/4))
plt.rcParams.update({'figure.figsize': (5,5)})
result_add.plot().suptitle('Additive Decompose', fontsize=22)
plt.show()

或者，如果您在此处采用循环的最大可能大小，即 8（共 16 个项目）。

period = 8:

result_add = seasonal_decompose(x=df_test['listData'], model='additive', extrapolate_trend='freq', period=int(len(df_test)/2))
plt.rcParams.update({'figure.figsize': (5,5)})
result_add.plot().suptitle('Additive Decompose', fontsize=22)
plt.show()

看看 y-axes 如何改变它们的比例。

####

您将根据需要增加周期整数。你问题的最大值：

sm.tsa.seasonal_decompose(df, model = 'additive', period = int(len(df)/2))

2 的详细信息：“...或 x 必须是 pandas 对象，其 DatetimeIndex 的频率未设置为 None”

要使 x 成为频率未设置为 None 的 DatetimeIndex，您需要使用 .asfreq('?') 和 ?在 https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases.

的各种偏移别名中作为您的选择

在你的情况下，这个选项 2. 更适合，因为你似乎有一个没有间隙的列表。那么您的月度数据可能应该作为“月开始频率”引入 --> “MS”作为偏移量别名：

sm.tsa.seasonal_decompose(df.asfreq('MS'), model = 'additive')

请参阅了解更多详细信息，以及您将如何处理间隙。

如果您的数据在时间上高度分散，以至于有太多间隙需要填补，或者如果时间上的间隙并不重要，则使用“句点”的选项 1 可能是更好的选择。

在我的 df_test 示例中，选项 2 并不好。数据在时间上完全分散，如果我以秒为频率，你会得到：

df_test.asfreq('s') 的输出（=以秒为单位的频率）：

2016-05-04 08:53:20      1
2016-05-04 08:53:21      2
2016-05-04 08:53:22      1
2016-05-04 08:53:23      9
2016-05-04 08:53:24    NaN
                      ...
2016-05-04 08:58:19    NaN
2016-05-04 08:58:20      1
2016-05-04 08:58:21      1
2016-05-04 08:58:22      3
2016-05-04 08:58:23      9
Freq: S, Name: listData, Length: 304, dtype: object

你在这里看到，虽然我的数据只有 16 行，但引入以秒为单位的频率会强制 df 为 304 行，只能从“08:53:20”到“08:58:23”，这里造成了288个缺口。更重要的是，在这里你必须准确的时间。如果你有 0.1 甚至 0.12314 秒作为你的真实频率，你将不会用你的索引命中大部分项目。

这里是一个以 min 作为偏移量别名的例子，df_test.asfreq('min'):

2016-05-04 08:53:20      1
2016-05-04 08:54:20    NaN
2016-05-04 08:55:20    NaN
2016-05-04 08:56:20    NaN
2016-05-04 08:57:20    NaN
2016-05-04 08:58:20      1

我们看到只有第一分钟和最后一分钟被填满，其余的都没有命中。

以天为偏移别名，df_test.asfreq('d'):

2016-05-04 08:53:20    1

我们看到您只得到第一行作为结果 df，因为只涵盖了一天。它会给你找到的第一个项目，其余的都被丢弃。

一切都结束了

综上所述，在您的情况下，请选择选项 2。而在我的示例 df_test 中，需要选项 1。

Answer 3

我遇到了同样的问题，最终证明（就我而言）是我的数据集中缺少数据点的问题。在示例中，我有一段时间的每小时数据，那里缺少 2 个单独的每小时数据点（在数据集的中间）。所以我得到了同样的错误。在没有丢失数据点的不同数据集上进行测试时，它没有任何错误消息。希望这可以帮助。这不完全是解决方案。

Answer 4

原因可能是您的数据存在差距。例如：

这个数据有差距，会导致seasonal_decompose()方法异常

这个数据不错，全天覆盖，不会抛出Exception

Answer 5

我遇到了同样的错误，我得到它是因为缺少一些日期。此处的快速解决方法是将这些日期与默认值一起添加。

使用默认值时要谨慎

如果您的模型是可加的，那么它可以有 0
如果你的模型是加法的，那么它不能有 0，所以你可以使用 1

代码：-

from datetime import date, timedelta
import pandas as pd

#Start date and end_date
start_date = pd.to_datetime("2019-06-01")
end_date = pd.to_datetime("2021-08-20") - timedelta(days=1) #Excluding last

#List of all dates
all_date = pd.date_range(start_date, end_date, freq='d')

#Left join your main data on dates data
all_date_df = pd.DataFrame({'date':all_date})
tdf = df.groupby('date', as_index=False)['session_count'].sum()
tdf = pd.merge(all_date_df, tdf, on='date', how="left")
tdf.fillna(0, inplace=True)

Answer 6

我遇到了同样的问题。通过强制周期为整数来解决这个问题。

因此，我针对特定情况使用了以下内容。

decompose_result = seasonal_decompose(df.Sales, model='multiplicative', period=1)
decompose_result.plot();

其中 df.Sales 是一个 Pandas 系列，任意两个元素之间的步长为 1。

PS。您可以通过输入 seasonal_decompose? 命令来查找 seasonal_decompose() 的详细信息。您将获得如下详细信息。查看每个参数的详细信息。

**Signature:**
seasonal_decompose(
    x,
    model='additive',
    filt=None,
    period=None,
    two_sided=True,
    extrapolate_trend=0,
)

Answer 7

我假设你忘记引入 period 并将其传递给 freq argument for seasonal_decompose()。这就是为什么它抛出了以下 ValueError:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-9b030cf1055e> in <module>()
----> 1 decomposition = sm.tsa.seasonal_decompose(df, model = 'additive')
      2 decompose_result.plot()

/usr/local/lib/python3.7/dist-packages/statsmodels/tsa/seasonal.py in seasonal_decompose(x, model, filt, freq, two_sided, extrapolate_trend)
    125             freq = pfreq
    126         else:
--> 127             raise ValueError("You must specify a freq or x must be a "
    128                              "pandas object with a timeseries index with "
    129                              "a freq not set to None")

ValueError: You must specify a freq or x must be a pandas object with a time-series index with a freq not set to None

注意：可能是由于最近更新这个模块没有可用的period参数，如果你使用period 参数 seasonal_decompose() 你将面临以下 TypeError:

TypeError: seasonal_decompose() got an unexpected keyword argument 'period'

所以请遵循以下脚本：

# import libraries
import matplotlib.pyplot as plt
import numpy as np
from statsmodels.tsa.seasonal import seasonal_decompose
 
# Generate time-series data
total_duration = 100
step = 0.01
time = np.arange(0, total_duration, step)
 
# Period of the sinusoidal signal in seconds
T= 15
 
# Period component
series_periodic = np.sin((2*np.pi/T)*time)
 
# Add a trend component
k0 = 2
k1 = 2
k2 = 0.05
k3 = 0.001
 
series_periodic = k0*series_periodic
series_trend    = k1*np.ones(len(time))+k2*time+k3*time**2
series          = series_periodic+series_trend 

# Set frequency using period in seasonal_decompose()
period = int(T/step)
results = seasonal_decompose(series, model='additive', freq=period)

trend_estimate    = results.trend
periodic_estimate = results.seasonal
residual          = results.resid
 
# Plot the time-series componentsplt.figure(figsize=(14,10))
plt.subplot(221)
plt.plot(series,label='Original time series', color='blue')
plt.plot(trend_estimate ,label='Trend of time series' , color='red')
plt.legend(loc='best',fontsize=20 , bbox_to_anchor=(0.90, -0.05))
plt.subplot(222)
plt.plot(trend_estimate,label='Trend of time series',color='blue')
plt.legend(loc='best',fontsize=20, bbox_to_anchor=(0.90, -0.05))
plt.subplot(223)
plt.plot(periodic_estimate,label='Seasonality of time series',color='blue')
plt.legend(loc='best',fontsize=20, bbox_to_anchor=(0.90, -0.05))
plt.subplot(224)
plt.plot(residual,label='Decomposition residuals of time series',color='blue')
plt.legend(loc='best',fontsize=20, bbox_to_anchor=(1.09, -0.05))
plt.tight_layout()
plt.savefig('decomposition.png')

绘制时间序列分量：

如果您正在使用 pandas 数据框：

# import libraries
import numpy as np
import pandas as pd
from statsmodels.tsa.seasonal import seasonal_decompose

# Generate some data
np.random.seed(0)
n = 1500
dates = np.array('2020-01-01', dtype=np.datetime64) + np.arange(n)
data = 12*np.sin(2*np.pi*np.arange(n)/365) + np.random.normal(12, 2, 1500)

#=================> Approach#1 <==================
# Set period after building dataframe
df = pd.DataFrame({'data': data}, index=dates)

# Reproduce the OP's example  
seasonal_decompose(df['data'], model='additive', freq=15).plot()

#=================> Approach#2 <==================
# create period once you create pandas dataframe by asfreq() after set dates as index
df = pd.DataFrame({'data': data,}, index=dates).asfreq('D').dropna()

# Reproduce the example for OP
seasonal_decompose(df , model='additive').plot()

Answer 8

我最近为此使用了 'Prophet' 包。它曾经被称为 'FBProphet'，但由于某些原因他们删除了 FB（Facebook）部分。

在 windows 电脑上安装有点困难（在这种情况下你需要 miniconda 来安装它）。

但是当它安装时，它非常用户友好，只需要 1 行代码，并且非常有用！还可以分解季节性，给出n%精度的图表，做出预测。

如果您愿意，它也可以轻松计算假期，这是 pre-build 到包中。

youtube 上有很多关于这个包的视频

https://github.com/facebook/prophet/tree/main/python