前向填充时间序列数据指定频率的某些列
Forward fill certain columns with specified frequency for time series data
我想向前填充 2 列:Time
和 X
in df
:
Time X Y Z
0 2020-01-15 06:12:49.213 0 0 0
1 2020-01-15 08:12:49.213 1 2 2
2 2020-01-15 10:12:49.213 3 6 9
3 2020-01-15 12:12:49.213 12 15 4
4 2020-01-15 14:12:49.213 8 4 3
但保持剩余列 Y
和 Z
不变,或者用 NaN
填充其他行。
我检查了 Pandas 文档 .fillna and .asfreq but they didn't cover forward fill certain columns. While this answer 确实如此,它没有指定频率。
预期输出(使用10s
频率):
Time X Y Z
0 2020-01-15 06:12:49.213 0 0 0
1 2020-01-15 06:12:59.213 0 NaN NaN # forward filled
2 2020-01-15 06:13:09.213 0 NaN NaN # forward filled
...
11 2020-01-15 08:12:49.213 1 2 2
12 2020-01-15 08:12:59.213 1 NaN NaN # forward filled
13 2020-01-15 08:13:09.213 1 NaN NaN # forward filled
...
22 2020-01-15 10:12:49.213 3 6 9
23 2020-01-15 10:12:59.213 3 NaN NaN # forward filled
...
您可以尝试 asfreq
重新采样。
工作流程:
- 首先我们将
Time
列设置为索引
- 对索引进行排序(如果没有,
asfreq
方法将失败)
现在让我们扩展数据帧。我们按照使用的方法操作resample两次:
- 如果没有提供方法(例如
None
),新值将填充为NaN
。我们将其用于列 Y
和 Z
对于X
列,方法ffill
"propagates last valid observation forward to next valid" doc.
正如您在评论中强调的那样,使用的频率对于了解是否保留所有值很重要。如果频率太大,某些值可能与间隔不匹配。因此,这些值将被跳过。为了克服这个问题,一个解决方案可能是使用更小的间隔(假设 1s
)。使用它,ffill
将正确应用于所有值。
但是,如果您真的想要一个 10S
日期范围数据框,我们需要重新采样。在这里,我们开始明白,通过这样做,我们将再次删除不在日期范围内的值。但这不是问题,因为我们已经有了这些值(它们是我们的输入)。所以我们可以使用 append
(like this, we will be sure to have all the values). We might even have duplicates, so remove them using drop_duplicates
.
将它们附加到我们的数据框
完整示例:
# Be sure it's a datetime object
df["Time"] = pd.to_datetime(df["Time"])
print(df)
# Set tme column as index
df.set_index(["Time"], inplace=True)
df = df.sort_index()
print(df)
# Time X Y Z
# 0 2020-01-15 06:12:49.213 0 0 0
# 1 2020-01-15 08:12:49.213 1 2 2
# 2 2020-01-15 10:12:49.213 3 6 9
# 3 2020-01-15 11:45:24.213 4 6 9
# 4 2020-01-15 12:12:49.213 12 15 4
# 5 2020-01-15 12:12:22.213 12 15 4
# 6 2020-01-15 14:12:49.213 8 4 3
# Resample
out = df[["Y", "Z"]].asfreq('10S')
out["X"] = df["X"].asfreq('1S', method="ffill").asfreq('10S')
# Reset index
out = out.append(df, sort=True).reset_index().drop_duplicates().reset_index(drop=True)
print(out)
# Time X Y Z
# 0 2020-01-15 06:12:49.213 0 0.0 0.0
# 1 2020-01-15 06:12:59.213 0 NaN NaN
# 2 2020-01-15 06:13:09.213 0 NaN NaN
# 3 2020-01-15 06:13:19.213 0 NaN NaN
# 4 2020-01-15 06:13:29.213 0 NaN NaN
# ... ... .. ... ...
# 2878 2020-01-15 14:12:29.213 12 NaN NaN
# 2879 2020-01-15 14:12:39.213 12 NaN NaN
# 2880 2020-01-15 14:12:49.213 8 4.0 3.0
# 2881 2020-01-15 11:45:24.213 4 6.0 9.0
# 2882 2020-01-15 12:12:22.213 12 15.0 4.0
# [2883 rows x 4 columns]
我想向前填充 2 列:Time
和 X
in df
:
Time X Y Z
0 2020-01-15 06:12:49.213 0 0 0
1 2020-01-15 08:12:49.213 1 2 2
2 2020-01-15 10:12:49.213 3 6 9
3 2020-01-15 12:12:49.213 12 15 4
4 2020-01-15 14:12:49.213 8 4 3
但保持剩余列 Y
和 Z
不变,或者用 NaN
填充其他行。
我检查了 Pandas 文档 .fillna and .asfreq but they didn't cover forward fill certain columns. While this answer 确实如此,它没有指定频率。
预期输出(使用10s
频率):
Time X Y Z
0 2020-01-15 06:12:49.213 0 0 0
1 2020-01-15 06:12:59.213 0 NaN NaN # forward filled
2 2020-01-15 06:13:09.213 0 NaN NaN # forward filled
...
11 2020-01-15 08:12:49.213 1 2 2
12 2020-01-15 08:12:59.213 1 NaN NaN # forward filled
13 2020-01-15 08:13:09.213 1 NaN NaN # forward filled
...
22 2020-01-15 10:12:49.213 3 6 9
23 2020-01-15 10:12:59.213 3 NaN NaN # forward filled
...
您可以尝试 asfreq
重新采样。
工作流程:
- 首先我们将
Time
列设置为索引 - 对索引进行排序(如果没有,
asfreq
方法将失败) 现在让我们扩展数据帧。我们按照使用的方法操作resample两次:
- 如果没有提供方法(例如
None
),新值将填充为NaN
。我们将其用于列Y
和Z
对于
X
列,方法ffill
"propagates last valid observation forward to next valid" doc.正如您在评论中强调的那样,使用的频率对于了解是否保留所有值很重要。如果频率太大,某些值可能与间隔不匹配。因此,这些值将被跳过。为了克服这个问题,一个解决方案可能是使用更小的间隔(假设
1s
)。使用它,ffill
将正确应用于所有值。但是,如果您真的想要一个
10S
日期范围数据框,我们需要重新采样。在这里,我们开始明白,通过这样做,我们将再次删除不在日期范围内的值。但这不是问题,因为我们已经有了这些值(它们是我们的输入)。所以我们可以使用append
(like this, we will be sure to have all the values). We might even have duplicates, so remove them usingdrop_duplicates
. 将它们附加到我们的数据框
- 如果没有提供方法(例如
完整示例:
# Be sure it's a datetime object
df["Time"] = pd.to_datetime(df["Time"])
print(df)
# Set tme column as index
df.set_index(["Time"], inplace=True)
df = df.sort_index()
print(df)
# Time X Y Z
# 0 2020-01-15 06:12:49.213 0 0 0
# 1 2020-01-15 08:12:49.213 1 2 2
# 2 2020-01-15 10:12:49.213 3 6 9
# 3 2020-01-15 11:45:24.213 4 6 9
# 4 2020-01-15 12:12:49.213 12 15 4
# 5 2020-01-15 12:12:22.213 12 15 4
# 6 2020-01-15 14:12:49.213 8 4 3
# Resample
out = df[["Y", "Z"]].asfreq('10S')
out["X"] = df["X"].asfreq('1S', method="ffill").asfreq('10S')
# Reset index
out = out.append(df, sort=True).reset_index().drop_duplicates().reset_index(drop=True)
print(out)
# Time X Y Z
# 0 2020-01-15 06:12:49.213 0 0.0 0.0
# 1 2020-01-15 06:12:59.213 0 NaN NaN
# 2 2020-01-15 06:13:09.213 0 NaN NaN
# 3 2020-01-15 06:13:19.213 0 NaN NaN
# 4 2020-01-15 06:13:29.213 0 NaN NaN
# ... ... .. ... ...
# 2878 2020-01-15 14:12:29.213 12 NaN NaN
# 2879 2020-01-15 14:12:39.213 12 NaN NaN
# 2880 2020-01-15 14:12:49.213 8 4.0 3.0
# 2881 2020-01-15 11:45:24.213 4 6.0 9.0
# 2882 2020-01-15 12:12:22.213 12 15.0 4.0
# [2883 rows x 4 columns]