我应该如何使用 pandas 处理时间序列数据中的重复时间？

Question

作为更大数据集的一部分，我从 API 调用返回了以下内容：

{'Time': datetime.datetime(2017, 5, 21, 18, 18, 1, tzinfo=tzutc()), 'Price': '0.052600'}

{'Time': datetime.datetime(2017, 5, 21, 18, 18, 1, tzinfo=tzutc()), 'Price': '0.052500'}

理想情况下，我会使用时间戳作为 pandas 数据框的索引，但这似乎失败了，因为在转换为 JSON:

时存在重复

df = df.set_index(pd.to_datetime(df['Timestamp']))
print(new_df.to_json(orient='index'))

ValueError: DataFrame index must be unique for orient='index'.

有任何关于处理这种情况的最佳方法的指导吗？扔掉一个数据点？时间没有比秒更精细，而且在那一秒内显然有价格变化。

Answer 1

我认为您可以通过添加 ms by cumcount and to_timedelta 来更改重复的日期时间：

d = [{'Time': datetime.datetime(2017, 5, 21, 18, 18, 1), 'Price': '0.052600'},
     {'Time': datetime.datetime(2017, 5, 21, 18, 18, 1), 'Price': '0.052500'}]
df = pd.DataFrame(d)
print (df)
      Price                Time
0  0.052600 2017-05-21 18:18:01
1  0.052500 2017-05-21 18:18:01

print (pd.to_timedelta(df.groupby('Time').cumcount(), unit='ms'))
0          00:00:00
1   00:00:00.001000
dtype: timedelta64[ns]

df['Time'] = df['Time'] + pd.to_timedelta(df.groupby('Time').cumcount(), unit='ms')
print (df)
      Price                    Time
0  0.052600 2017-05-21 18:18:01.000
1  0.052500 2017-05-21 18:18:01.001

new_df = df.set_index('Time')
print(new_df.to_json(orient='index'))
{"1495390681000":{"Price":"0.052600"},"1495390681001":{"Price":"0.052500"}}

Answer 2

您可以使用 .duplicated 来保留第一个或最后一个条目。看看pandas.DataFrame.duplicated

Answer 3

只是为了扩展：添加循环有助于处理第一遍引入的任何新重复项。

这 isnull 对于捕获数据中的任何 NaT 很重要。因为任何 timedelta + NaT 仍然是 NaT.

def deduplicate_start_times(frame, col='start_time', max_iterations=10):
    """
    Fuzz duplicate start times from a frame so we can stack and unstack
    this frame.
    """

    for _ in range(max_iterations):
        dups = frame.duplicated(subset=col) & ~pandas.isnull(frame[col])

        if not dups.any():
            break

        LOGGER.debug("Removing %i duplicates", dups.sum())

        # Add several ms of time to each time
        frame[col] += pandas.to_timedelta(frame.groupby(col).cumcount(),
                                          unit='ms')

    else:
        LOGGER.error("Exceeded max iterations removing duplicates. "
                     "%i duplicates remain", dups.sum())

    return frame

我应该如何使用 pandas 处理时间序列数据中的重复时间？

How should I Handle duplicate times in time series data with pandas?

python

time-series

data-processing

pandas