pandas 用最近的邻居插值 returns 新值

Question

我想使用最近邻插值法对这些数据进行上采样。

file.csv

ProcessStepId,_time
0,2019-03-14 01:35:59.769
0,2019-03-14 01:37:59.076
0,2019-03-14 01:39:59.723
0,2019-03-14 01:42:00.145
1,2019-03-14 01:42:04.478
1,2019-03-14 01:43:59.818
1,2019-03-14 01:45:59.776
1,2019-03-14 01:47:59.802

到目前为止我的方法是：将 csv 文件读入数据帧并将其转换为 DateTime 索引数据帧。然后对其进行上采样并使用最近的邻居

进行插值

df = pd.read_csv(file.csv)
form = '%Y-%m-%d %H:%M:%S'
df['_time'] = pd.to_datetime(df['_time'].dt.strftime(form), exact=False)
df.set_index('_time', inplace=True)

#Now upsample
df = df.resample('10s').mean()
df.interpolate(method='nearest', inplace=True)

我的输出如下所示：

_time,    ProcessStepId
2019-03-14 01:35:50, 0.0
2019-03-14 01:36:00, 0.0
2019-03-14 01:36:10, 0.0
2019-03-14 01:36:20, 0.0
2019-03-14 01:36:30, 0.0
2019-03-14 01:36:40, 0.0
2019-03-14 01:36:50, 0.0
2019-03-14 01:37:00, 0.0
2019-03-14 01:37:10, 0.0
2019-03-14 01:37:20, 0.0
2019-03-14 01:37:30, 0.0
2019-03-14 01:37:40, 0.0
2019-03-14 01:37:50, 0.0
2019-03-14 01:38:00, 0.0
2019-03-14 01:38:10, 0.0
2019-03-14 01:38:20, 0.0
2019-03-14 01:38:30, 0.0
2019-03-14 01:38:40, 0.0
2019-03-14 01:38:50, 0.0
2019-03-14 01:39:00, 0.0
2019-03-14 01:39:10, 0.0
2019-03-14 01:39:20, 0.0
2019-03-14 01:39:30, 0.0
2019-03-14 01:39:40, 0.0
2019-03-14 01:39:50, 0.0
2019-03-14 01:40:00, 0.0
2019-03-14 01:40:10, 0.0
2019-03-14 01:40:20, 0.0
2019-03-14 01:40:30, 0.0
2019-03-14 01:40:40, 0.0
2019-03-14 01:40:50, 0.0
2019-03-14 01:41:00, 0.5
2019-03-14 01:41:10, 0.5
2019-03-14 01:41:20, 0.5
2019-03-14 01:41:30, 0.5
2019-03-14 01:41:40, 0.5
2019-03-14 01:41:50, 0.5
2019-03-14 01:42:00, 0.5
2019-03-14 01:42:10, 0.5
2019-03-14 01:42:20, 0.5
2019-03-14 01:42:30, 0.5
2019-03-14 01:42:40, 0.5
2019-03-14 01:42:50, 0.5
2019-03-14 01:43:00, 1.0
2019-03-14 01:43:10, 1.0
2019-03-14 01:43:20, 1.0
2019-03-14 01:43:30, 1.0
2019-03-14 01:43:40, 1.0
2019-03-14 01:43:50, 1.0
2019-03-14 01:44:00, 1.0
2019-03-14 01:44:10, 1.0
2019-03-14 01:44:20, 1.0
2019-03-14 01:44:30, 1.0
2019-03-14 01:44:40, 1.0
2019-03-14 01:44:50, 1.0
2019-03-14 01:45:00, 1.0
2019-03-14 01:45:10, 1.0
2019-03-14 01:45:20, 1.0
2019-03-14 01:45:30, 1.0
2019-03-14 01:45:40, 1.0
2019-03-14 01:45:50, 1.0
2019-03-14 01:46:00, 1.0
2019-03-14 01:46:10, 1.0
2019-03-14 01:46:20, 1.0
2019-03-14 01:46:30, 1.0
2019-03-14 01:46:40, 1.0
2019-03-14 01:46:50, 1.0
2019-03-14 01:47:00, 1.0
2019-03-14 01:47:10, 1.0
2019-03-14 01:47:20, 1.0
2019-03-14 01:47:30, 1.0
2019-03-14 01:47:40, 1.0
2019-03-14 01:47:50, 1.0

我希望所有 ProcessStepId 值都等于 1 或 0（理想情况下是整数），但这里为某些行分配了值 0.5（这对我的用例无效）。此外，我希望 2019-03-14 01:42:04.478 之后的任何值都绝对等于 1，但这里不是这种情况。

我是否遗漏了一些关于最近邻如何工作的信息？

Answer 1

df.resample(...).mean() 创建 0.5 值。进行重采样时只需使用 nearest 即可将缺失值替换为序列中最近的邻居：

df = df.resample('10s').nearest()

pandas 用最近的邻居插值 returns 新值

pandas interpolate with nearest neighbor returns new values

python

nearest-neighbor

resampling

pandas