仅当时间范围太长时如何对数据帧进行重新采样？

Question

我有一个像这样的简单 DataFrame：

timestamp	Power
29/08/2021 02:30:16	155
29/08/2021 02:45:19	151
29/08/2021 03:00:14	155
29/08/2021 03:30:12	152
29/08/2021 04:00:12	149
29/08/2021 04:15:09	152
29/08/2021 04:30:16	153
29/08/2021 04:45:09	211
29/08/2021 05:30:19	77

所以这些数据应该每 15 分钟测量一次，但由于某些原因，一些测量被跳过了。我想在跳过测量时添加缺少的时间戳，后跟“NaN”。我知道这可以通过函数“resample”来完成，但重要的是 仅在需要时使用它。所以我需要的是向该函数添加一个 condition：我只想在那些行之间重新采样（例如）超过 16 分钟距离 来自彼此。这样，当我不需要重新采样时，时间戳仍然是原来的，这对我的工作非常重要。所以我想得到的，大致是：

timestamp	Power
29/08/2021 02:30:16	155
29/08/2021 02:45:19	151
29/08/2021 03:00:14	155
29/08/2021 03:15:00	NaN
29/08/2021 03:30:12	152
29/08/2021 03:45:00	NaN
29/08/2021 04:00:12	149
29/08/2021 04:15:09	152
29/08/2021 04:30:16	153
29/08/2021 04:45:09	211
29/08/2021 05:00:00	NaN
29/08/2021 05:15:00	NaN
29/08/2021 05:30:19	77

Answer 1

好吧，这比我预期的要棘手，但我想我已经解决了。这是我的解决方案：

我为你的 df 创建了一个玩具示例（请下次自己提供此代码，例如 here 所述）

import pandas as pd
import datetime

df = pd.DataFrame()
df['timestamp'] = ['29/08/2021 02:30:16', '29/08/2021 02:45:19', '29/08/2021 03:00:14', '29/08/2021 03:30:12']
df['Power'] = [155,151,155,152]

df 看起来像这样：

   timestamp              Power
0  29/08/2021 02:30:16    155
1  29/08/2021 02:45:19    151
2  29/08/2021 03:00:14    155
3  29/08/2021 03:30:12    152

首先我们将 timestamp 列转换为 pandas 日期时间对象，然后用它替换数据框的轴。

df.timestamp = pd.to_datetime(df.timestamp)
df.set_index('timestamp', inplace=True)

这允许我们对其使用 resample，但正如您已经注意到的那样，这将创建一个全新的日期范围，而不是合并您自己的日期范围。我解决这个问题的方法是只对每对连续的时间戳使用重新采样。这样它只会在时间戳之间有“space”时添加新条目。

final_df = pd.DataFrame()
timestamp_list = []
power_list = []
for i, timestamp in enumerate(df.index.to_list()):
    temp_df = df[i:i+2].resample('16Min', origin='start').asfreq()
    timestamp_list.extend(temp_df.index.to_list())
    power_list.extend(temp_df.Power.to_list())
final_df['timestamp'] = timestamp_list
final_df['Power'] = power_list

结果如下所示：

  timestamp            Power
0 2021-08-29 02:30:16  155.0
1 2021-08-29 02:45:19  151.0
2 2021-08-29 03:00:14  155.0
3 2021-08-29 03:15:14    NaN
4 2021-08-29 03:30:12  152.0

如果您想 re-format 日期格式与之前完全相同，我建议查看 datetime 包。或者您可以通过遍历列手动完成。

Answer 2

为了重现你的数据我做了：

import pandas as pd
data = pd.DataFrame.from_records(
    [
        ["29/08/2021 02:30:16", 155],
        ["29/08/2021 02:45:19", 151],
        ["29/08/2021 02:47:19", 152],
        ["29/08/2021 03:00:14", 155],
        ["29/08/2021 03:30:12", 152],
        ["29/08/2021 04:00:12", 149],
        ["29/08/2021 04:15:09", 152],
        ["29/08/2021 04:30:16", 153],
        ["29/08/2021 04:45:09", 211],
        ["29/08/2021 05:30:19", 77]
    ],
    columns=["timestamp", "Power"],
)
data["timestamp"] = pd.to_datetime(data["timestamp"])

为了填补空白，我完成了以下步骤。

首先，使用四舍五入的时间戳创建一个新列：

data["t_rounded"] = data["timestamp"].dt.round("15min")
data.set_index("t_rounded", inplace=True, drop=True)

通过删除所有重复项并仅保留第一个样本来确保没有重复的索引：

# drop any duplicated samples which occurred too close
is_duplicate = data.index.duplicated(keep='last')
# keep the duplicates which we are going to remove
duplicates_df = data[is_duplicate]

# remove the duplicates from the original data
data = data[~is_duplicate]

然后，创建一个新的所需的等距索引：

new_index = pd.period_range(data.index.values[0], data.index.values[-1], freq="15min")
new_index = new_index.to_timestamp()

现在为您的数据框使用新索引：

data = data.reindex(new_index)
data.reset_index(inplace=True)

接下来，将舍入时间戳（由于 reset_index 而现在称为索引）施加到空时间

mask = data["timestamp"].isna()
data.loc[mask, "timestamp"] = data.loc[mask, "index"]

最后，将新填充的时间戳设置为索引并删除舍入时间列

data.set_index("timestamp", inplace=True, drop=True)
data.drop("index", inplace=True, axis=1)

如果需要，您可以添加我们之前删除的重复时间戳，方法是：

df = duplicates_df.reset_index().set_index("timestamp", drop=True).drop("t_rounded", axis=1)
data = pd.concat([data, df]).sort_index()

最后的结果是这样的

                     Power
timestamp                 
2021-08-29 02:30:16  155.0
2021-08-29 02:45:19  151.0
2021-08-29 02:47:19  152.0
2021-08-29 03:00:14  155.0
2021-08-29 03:15:00    NaN
2021-08-29 03:30:12  152.0
2021-08-29 03:45:00    NaN
2021-08-29 04:00:12  149.0
2021-08-29 04:15:09  152.0
2021-08-29 04:30:16  153.0
2021-08-29 04:45:09  211.0
2021-08-29 05:00:00    NaN
2021-08-29 05:15:00    NaN
2021-08-29 05:30:19   77.0

仅当时间范围太长时如何对数据帧进行重新采样？

How to resample a dataframe ONLY when time range is too long?

python

conditional-statements

dataframe

pandas

pandas-resample