如何长时间检测时间序列中的缺失值
How to detect missing values in a time series for long periods
我有2016-2019年的用电量数据。数据每 30 分钟记录一次,持续 4 年。 13/03/2019 - 31/03/209之间没有数据。
- 我想问一下如何在没有可视化的情况下通过编码检测到这种缺失,因为我有 12 个国家,它们在其他月份可能有这样的缺失值,而且它们是不可见的。 (检测是否连续超过3天未命中)。感谢您的帮助!
这是数据:
Country Code Electric Consumption (MW)
Date (index)
2016-01-01 84 354642.0
2016-01-02 84 376207.0
2016-01-03 84 381534.0
2016-01-04 84 435561.0
2016-01-05 84 447820.0
... ... ...
2019-12-27 12 374340.0
2019-12-28 12 372761.0
2019-12-29 12 379411.0
2019-12-30 12 416044.0
2019-12-31 12 87519.0
在这种情况下,您是最好的判断者,您必须根据您尝试做的事情的最佳结果来做出决定。顺便说一下,您正在尝试填充缺失值而不是识别它们。
DataFrame.interpolate()
函数中有一些选项,您可以在这里找到。
还有其他选项,您实际上不需要对数据做任何事情,您只需从相邻行复制值,例如 DataFrame.ffill()
和 DataFrame.bfill()
。
这是一种识别 3 天间隔并填补它们的方法。请注意,这适用于每个唯一的国家/地区代码。您可以使用列表保存 final_dfs 并在需要将它们重新组合在一起时使用 pd.concat()。
import pandas as pd
import numpy as np
df = pd.DataFrame({'Country Code': {'2016-01-01': 84,
'2016-01-02': 84,
'2016-01-03': 84,
'2016-01-04': 84,
'2016-01-05': 84,
'2019-12-27': 12,
'2019-12-28': 12,
'2019-12-29': 12,
'2019-12-30': 12,
'2019-12-31': 12},
'Electric Consumption (MW)': {'2016-01-01': 354642.0,
'2016-01-02': 376207.0,
'2016-01-03': 381534.0,
'2016-01-04': 435561.0,
'2016-01-05': 447820.0,
'2019-12-27': 374340.0,
'2019-12-28': 372761.0,
'2019-12-29': 379411.0,
'2019-12-30': 416044.0,
'2019-12-31': 87519.0}})
# change index value for fake gap
df.index = ['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
'2016-01-05', '2019-12-27', '2019-12-28',
'2019-12-29', '2019-12-30', '2020-01-06']
#convert object dates to datetime
df.index = pd.to_datetime(df.index)
for g in df['Country Code'].unique():
# look at each unique country
country_slice = df.loc[df['Country Code'] == g]
# use timedelta to identify 3 day gaps
country_slice['Three Day Gap'] = country_slice.index.to_series().diff() > pd.Timedelta('3d')
# create a new index with previous min and max
idx = pd.date_range(country_slice.index.min(), country_slice.index.max())
s = country_slice['Electric Consumption (MW)']
s.index = pd.DatetimeIndex(s.index)
# this gives us a series with new rows and nans for the missing dates
s = s.reindex(idx, fill_value=np.nan)
# join the old data back to nex index
country_slice_join = country_slice.join(s, how='outer', lsuffix='L')
# now we can interpolate as missing dates are new rows
country_slice_join['interpolate'] = country_slice_join['Electric Consumption (MW)'].interpolate(method='linear', axis=0)
country_slice_join['Country Code'] = country_slice_join['Country Code'].ffill()
# remove temp columns
final_df = country_slice_join[['Country Code', 'interpolate']]
final_df.columns = ['Country Code', 'Electric Consumption (MW)']
示例 country_slice_join 在 final_df 之前输出:
Country Code Electric Consumption (MW)L Three Day Gap Electric Consumption (MW) interpolate
2019-12-27 12.0 374340.0 False 374340.0 374340.000000
2019-12-28 12.0 372761.0 False 372761.0 372761.000000
2019-12-29 12.0 379411.0 False 379411.0 379411.000000
2019-12-30 12.0 416044.0 False 416044.0 416044.000000
2019-12-31 12.0 NaN NaN NaN 369111.857143
2020-01-01 12.0 NaN NaN NaN 322179.714286
2020-01-02 12.0 NaN NaN NaN 275247.571429
2020-01-03 12.0 NaN NaN NaN 228315.428571
2020-01-04 12.0 NaN NaN NaN 181383.285714
2020-01-05 12.0 NaN NaN NaN 134451.142857
2020-01-06 12.0 87519.0 True 87519.0 87519.000000
示例 final_df 没有临时列的输出:
Country Code Electric Consumption (MW)
2019-12-27 12.0 374340.000000
2019-12-28 12.0 372761.000000
2019-12-29 12.0 379411.000000
2019-12-30 12.0 416044.000000
2019-12-31 12.0 369111.857143
2020-01-01 12.0 322179.714286
2020-01-02 12.0 275247.571429
2020-01-03 12.0 228315.428571
2020-01-04 12.0 181383.285714
2020-01-05 12.0 134451.142857
2020-01-06 12.0 87519.000000
我有2016-2019年的用电量数据。数据每 30 分钟记录一次,持续 4 年。 13/03/2019 - 31/03/209之间没有数据。
- 我想问一下如何在没有可视化的情况下通过编码检测到这种缺失,因为我有 12 个国家,它们在其他月份可能有这样的缺失值,而且它们是不可见的。 (检测是否连续超过3天未命中)。感谢您的帮助!
这是数据:
Country Code Electric Consumption (MW)
Date (index)
2016-01-01 84 354642.0
2016-01-02 84 376207.0
2016-01-03 84 381534.0
2016-01-04 84 435561.0
2016-01-05 84 447820.0
... ... ...
2019-12-27 12 374340.0
2019-12-28 12 372761.0
2019-12-29 12 379411.0
2019-12-30 12 416044.0
2019-12-31 12 87519.0
在这种情况下,您是最好的判断者,您必须根据您尝试做的事情的最佳结果来做出决定。顺便说一下,您正在尝试填充缺失值而不是识别它们。
DataFrame.interpolate()
函数中有一些选项,您可以在这里找到。
还有其他选项,您实际上不需要对数据做任何事情,您只需从相邻行复制值,例如 DataFrame.ffill()
和 DataFrame.bfill()
。
这是一种识别 3 天间隔并填补它们的方法。请注意,这适用于每个唯一的国家/地区代码。您可以使用列表保存 final_dfs 并在需要将它们重新组合在一起时使用 pd.concat()。
import pandas as pd
import numpy as np
df = pd.DataFrame({'Country Code': {'2016-01-01': 84,
'2016-01-02': 84,
'2016-01-03': 84,
'2016-01-04': 84,
'2016-01-05': 84,
'2019-12-27': 12,
'2019-12-28': 12,
'2019-12-29': 12,
'2019-12-30': 12,
'2019-12-31': 12},
'Electric Consumption (MW)': {'2016-01-01': 354642.0,
'2016-01-02': 376207.0,
'2016-01-03': 381534.0,
'2016-01-04': 435561.0,
'2016-01-05': 447820.0,
'2019-12-27': 374340.0,
'2019-12-28': 372761.0,
'2019-12-29': 379411.0,
'2019-12-30': 416044.0,
'2019-12-31': 87519.0}})
# change index value for fake gap
df.index = ['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
'2016-01-05', '2019-12-27', '2019-12-28',
'2019-12-29', '2019-12-30', '2020-01-06']
#convert object dates to datetime
df.index = pd.to_datetime(df.index)
for g in df['Country Code'].unique():
# look at each unique country
country_slice = df.loc[df['Country Code'] == g]
# use timedelta to identify 3 day gaps
country_slice['Three Day Gap'] = country_slice.index.to_series().diff() > pd.Timedelta('3d')
# create a new index with previous min and max
idx = pd.date_range(country_slice.index.min(), country_slice.index.max())
s = country_slice['Electric Consumption (MW)']
s.index = pd.DatetimeIndex(s.index)
# this gives us a series with new rows and nans for the missing dates
s = s.reindex(idx, fill_value=np.nan)
# join the old data back to nex index
country_slice_join = country_slice.join(s, how='outer', lsuffix='L')
# now we can interpolate as missing dates are new rows
country_slice_join['interpolate'] = country_slice_join['Electric Consumption (MW)'].interpolate(method='linear', axis=0)
country_slice_join['Country Code'] = country_slice_join['Country Code'].ffill()
# remove temp columns
final_df = country_slice_join[['Country Code', 'interpolate']]
final_df.columns = ['Country Code', 'Electric Consumption (MW)']
示例 country_slice_join 在 final_df 之前输出:
Country Code Electric Consumption (MW)L Three Day Gap Electric Consumption (MW) interpolate
2019-12-27 12.0 374340.0 False 374340.0 374340.000000
2019-12-28 12.0 372761.0 False 372761.0 372761.000000
2019-12-29 12.0 379411.0 False 379411.0 379411.000000
2019-12-30 12.0 416044.0 False 416044.0 416044.000000
2019-12-31 12.0 NaN NaN NaN 369111.857143
2020-01-01 12.0 NaN NaN NaN 322179.714286
2020-01-02 12.0 NaN NaN NaN 275247.571429
2020-01-03 12.0 NaN NaN NaN 228315.428571
2020-01-04 12.0 NaN NaN NaN 181383.285714
2020-01-05 12.0 NaN NaN NaN 134451.142857
2020-01-06 12.0 87519.0 True 87519.0 87519.000000
示例 final_df 没有临时列的输出:
Country Code Electric Consumption (MW)
2019-12-27 12.0 374340.000000
2019-12-28 12.0 372761.000000
2019-12-29 12.0 379411.000000
2019-12-30 12.0 416044.000000
2019-12-31 12.0 369111.857143
2020-01-01 12.0 322179.714286
2020-01-02 12.0 275247.571429
2020-01-03 12.0 228315.428571
2020-01-04 12.0 181383.285714
2020-01-05 12.0 134451.142857
2020-01-06 12.0 87519.000000