如何长时间检测时间序列中的缺失值

Question

我有2016-2019年的用电量数据。数据每 30 分钟记录一次，持续 4 年。 13/03/2019 - 31/03/209之间没有数据。

我想问一下如何在没有可视化的情况下通过编码检测到这种缺失，因为我有 12 个国家，它们在其他月份可能有这样的缺失值，而且它们是不可见的。（检测是否连续超过3天未命中）。感谢您的帮助！

这是数据：

        Country Code   Electric Consumption (MW)
Date (index)        
2016-01-01    84              354642.0
2016-01-02    84              376207.0
2016-01-03    84              381534.0
2016-01-04    84              435561.0
2016-01-05    84              447820.0

... ... ...
2019-12-27    12              374340.0
2019-12-28    12              372761.0
2019-12-29    12              379411.0
2019-12-30    12              416044.0
2019-12-31    12              87519.0

Answer 1

在这种情况下，您是最好的判断者，您必须根据您尝试做的事情的最佳结果来做出决定。顺便说一下，您正在尝试填充缺失值而不是识别它们。

DataFrame.interpolate() 函数中有一些选项，您可以在这里找到。

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html?highlight=interpolate#pandas.DataFrame.interpolate

还有其他选项，您实际上不需要对数据做任何事情，您只需从相邻行复制值，例如 DataFrame.ffill() 和 DataFrame.bfill()。

Answer 2

这是一种识别 3 天间隔并填补它们的方法。请注意，这适用于每个唯一的国家/地区代码。您可以使用列表保存 final_dfs 并在需要将它们重新组合在一起时使用 pd.concat()。

import pandas as pd
import numpy as np

df = pd.DataFrame({'Country Code': {'2016-01-01': 84,
  '2016-01-02': 84,
  '2016-01-03': 84,
  '2016-01-04': 84,
  '2016-01-05': 84,
  '2019-12-27': 12,
  '2019-12-28': 12,
  '2019-12-29': 12,
  '2019-12-30': 12,
  '2019-12-31': 12},
 'Electric Consumption (MW)': {'2016-01-01': 354642.0,
  '2016-01-02': 376207.0,
  '2016-01-03': 381534.0,
  '2016-01-04': 435561.0,
  '2016-01-05': 447820.0,
  '2019-12-27': 374340.0,
  '2019-12-28': 372761.0,
  '2019-12-29': 379411.0,
  '2019-12-30': 416044.0,
  '2019-12-31': 87519.0}})


# change index value for fake gap
df.index = ['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
               '2016-01-05', '2019-12-27', '2019-12-28',
               '2019-12-29', '2019-12-30', '2020-01-06']

#convert object dates to datetime
df.index = pd.to_datetime(df.index)


for g in df['Country Code'].unique():
    
    # look at each unique country 
    country_slice = df.loc[df['Country Code'] == g]
    # use timedelta to identify 3 day gaps
    country_slice['Three Day Gap'] = country_slice.index.to_series().diff() > pd.Timedelta('3d')
    # create a new index with previous min and max
    idx = pd.date_range(country_slice.index.min(), country_slice.index.max())

    s = country_slice['Electric Consumption (MW)']

    s.index = pd.DatetimeIndex(s.index)
    # this gives us a series with new rows and nans for the missing dates
    s = s.reindex(idx, fill_value=np.nan)
    # join the old data back to nex index
    country_slice_join = country_slice.join(s, how='outer', lsuffix='L')
    # now we can interpolate as missing dates are new rows
    country_slice_join['interpolate'] = country_slice_join['Electric Consumption (MW)'].interpolate(method='linear', axis=0)
    
    country_slice_join['Country Code'] = country_slice_join['Country Code'].ffill()
    # remove temp columns
    final_df = country_slice_join[['Country Code', 'interpolate']]
    
    final_df.columns = ['Country Code', 'Electric Consumption (MW)']

示例 country_slice_join 在 final_df 之前输出：

           Country Code Electric Consumption (MW)L  Three Day Gap   Electric Consumption (MW)   interpolate
2019-12-27  12.0    374340.0    False   374340.0    374340.000000
2019-12-28  12.0    372761.0    False   372761.0    372761.000000
2019-12-29  12.0    379411.0    False   379411.0    379411.000000
2019-12-30  12.0    416044.0    False   416044.0    416044.000000
2019-12-31  12.0    NaN         NaN     NaN         369111.857143
2020-01-01  12.0    NaN         NaN     NaN         322179.714286
2020-01-02  12.0    NaN         NaN     NaN         275247.571429
2020-01-03  12.0    NaN         NaN     NaN         228315.428571
2020-01-04  12.0    NaN         NaN     NaN         181383.285714
2020-01-05  12.0    NaN         NaN     NaN         134451.142857
2020-01-06  12.0    87519.0     True    87519.0     87519.000000

示例 final_df 没有临时列的输出：

         Country Code   Electric Consumption (MW)
2019-12-27  12.0    374340.000000
2019-12-28  12.0    372761.000000
2019-12-29  12.0    379411.000000
2019-12-30  12.0    416044.000000
2019-12-31  12.0    369111.857143
2020-01-01  12.0    322179.714286
2020-01-02  12.0    275247.571429
2020-01-03  12.0    228315.428571
2020-01-04  12.0    181383.285714
2020-01-05  12.0    134451.142857
2020-01-06  12.0    87519.000000

如何长时间检测时间序列中的缺失值

How to detect missing values in a time series for long periods

python

time-series

missing-data

pandas