我应该如何使用 pandas 处理时间序列数据中的重复时间?
How should I Handle duplicate times in time series data with pandas?
作为更大数据集的一部分,我从 API 调用返回了以下内容:
{'Time': datetime.datetime(2017, 5, 21, 18, 18, 1,
tzinfo=tzutc()), 'Price': '0.052600'}
{'Time': datetime.datetime(2017, 5, 21, 18, 18, 1, tzinfo=tzutc()),
'Price': '0.052500'}
理想情况下,我会使用时间戳作为 pandas 数据框的索引,但这似乎失败了,因为在转换为 JSON:
时存在重复
df = df.set_index(pd.to_datetime(df['Timestamp']))
print(new_df.to_json(orient='index'))
ValueError: DataFrame index must be unique for orient='index'.
有任何关于处理这种情况的最佳方法的指导吗?扔掉一个数据点?时间没有比秒更精细,而且在那一秒内显然有价格变化。
我认为您可以通过添加 ms
by cumcount
and to_timedelta
来更改重复的日期时间:
d = [{'Time': datetime.datetime(2017, 5, 21, 18, 18, 1), 'Price': '0.052600'},
{'Time': datetime.datetime(2017, 5, 21, 18, 18, 1), 'Price': '0.052500'}]
df = pd.DataFrame(d)
print (df)
Price Time
0 0.052600 2017-05-21 18:18:01
1 0.052500 2017-05-21 18:18:01
print (pd.to_timedelta(df.groupby('Time').cumcount(), unit='ms'))
0 00:00:00
1 00:00:00.001000
dtype: timedelta64[ns]
df['Time'] = df['Time'] + pd.to_timedelta(df.groupby('Time').cumcount(), unit='ms')
print (df)
Price Time
0 0.052600 2017-05-21 18:18:01.000
1 0.052500 2017-05-21 18:18:01.001
new_df = df.set_index('Time')
print(new_df.to_json(orient='index'))
{"1495390681000":{"Price":"0.052600"},"1495390681001":{"Price":"0.052500"}}
您可以使用 .duplicated 来保留第一个或最后一个条目。看看pandas.DataFrame.duplicated
只是为了扩展 :添加循环有助于处理第一遍引入的任何新重复项。
这 isnull
对于捕获数据中的任何 NaT 很重要。因为任何 timedelta + NaT
仍然是 NaT
.
def deduplicate_start_times(frame, col='start_time', max_iterations=10):
"""
Fuzz duplicate start times from a frame so we can stack and unstack
this frame.
"""
for _ in range(max_iterations):
dups = frame.duplicated(subset=col) & ~pandas.isnull(frame[col])
if not dups.any():
break
LOGGER.debug("Removing %i duplicates", dups.sum())
# Add several ms of time to each time
frame[col] += pandas.to_timedelta(frame.groupby(col).cumcount(),
unit='ms')
else:
LOGGER.error("Exceeded max iterations removing duplicates. "
"%i duplicates remain", dups.sum())
return frame
作为更大数据集的一部分,我从 API 调用返回了以下内容:
{'Time': datetime.datetime(2017, 5, 21, 18, 18, 1, tzinfo=tzutc()), 'Price': '0.052600'}
{'Time': datetime.datetime(2017, 5, 21, 18, 18, 1, tzinfo=tzutc()), 'Price': '0.052500'}
理想情况下,我会使用时间戳作为 pandas 数据框的索引,但这似乎失败了,因为在转换为 JSON:
时存在重复df = df.set_index(pd.to_datetime(df['Timestamp']))
print(new_df.to_json(orient='index'))
ValueError: DataFrame index must be unique for orient='index'.
有任何关于处理这种情况的最佳方法的指导吗?扔掉一个数据点?时间没有比秒更精细,而且在那一秒内显然有价格变化。
我认为您可以通过添加 ms
by cumcount
and to_timedelta
来更改重复的日期时间:
d = [{'Time': datetime.datetime(2017, 5, 21, 18, 18, 1), 'Price': '0.052600'},
{'Time': datetime.datetime(2017, 5, 21, 18, 18, 1), 'Price': '0.052500'}]
df = pd.DataFrame(d)
print (df)
Price Time
0 0.052600 2017-05-21 18:18:01
1 0.052500 2017-05-21 18:18:01
print (pd.to_timedelta(df.groupby('Time').cumcount(), unit='ms'))
0 00:00:00
1 00:00:00.001000
dtype: timedelta64[ns]
df['Time'] = df['Time'] + pd.to_timedelta(df.groupby('Time').cumcount(), unit='ms')
print (df)
Price Time
0 0.052600 2017-05-21 18:18:01.000
1 0.052500 2017-05-21 18:18:01.001
new_df = df.set_index('Time')
print(new_df.to_json(orient='index'))
{"1495390681000":{"Price":"0.052600"},"1495390681001":{"Price":"0.052500"}}
您可以使用 .duplicated 来保留第一个或最后一个条目。看看pandas.DataFrame.duplicated
只是为了扩展
这 isnull
对于捕获数据中的任何 NaT 很重要。因为任何 timedelta + NaT
仍然是 NaT
.
def deduplicate_start_times(frame, col='start_time', max_iterations=10):
"""
Fuzz duplicate start times from a frame so we can stack and unstack
this frame.
"""
for _ in range(max_iterations):
dups = frame.duplicated(subset=col) & ~pandas.isnull(frame[col])
if not dups.any():
break
LOGGER.debug("Removing %i duplicates", dups.sum())
# Add several ms of time to each time
frame[col] += pandas.to_timedelta(frame.groupby(col).cumcount(),
unit='ms')
else:
LOGGER.error("Exceeded max iterations removing duplicates. "
"%i duplicates remain", dups.sum())
return frame