Pandas 将数据重新采样到秒,每 ~10 秒分组一次
Pandas resample data to the second, grouping by every ~10 seconds
假设我有以下数据框:
>>> df
a
2019-04-05 00:00:00 2.0
2019-04-05 00:00:01 1.0
2019-04-05 00:00:02 NaN
2019-04-05 00:00:03 NaN
2019-04-05 00:00:04 NaN
2019-04-05 00:00:05 NaN
2019-04-05 00:00:06 NaN
2019-04-05 00:00:07 NaN
2019-04-05 00:00:08 3.0
2019-04-05 00:00:09 4.0
2019-04-05 00:00:10 NaN
2019-04-05 00:00:11 NaN
2019-04-05 00:00:12 NaN
2019-04-05 00:00:13 NaN
2019-04-05 00:00:14 NaN
2019-04-05 00:00:15 NaN
2019-04-05 00:00:16 NaN
2019-04-05 00:00:17 NaN
2019-04-05 00:00:18 NaN
2019-04-05 00:00:19 NaN
2019-04-05 00:00:20 4.0
2019-04-05 00:00:21 5.0
2019-04-05 00:00:22 NaN
2019-04-05 00:00:23 NaN
2019-04-05 00:00:24 NaN
2019-04-05 00:00:25 NaN
2019-04-05 00:00:26 6.0
2019-04-05 00:00:27 NaN
2019-04-05 00:00:28 4.0
2019-04-05 00:00:29 NaN
2019-04-05 00:00:30 NaN
2019-04-05 00:00:31 NaN
我希望每 7 秒有 1 个值(假设有一个值,否则只是一个 NaN),因此数据框如下所示:
>>> df
a
2019-04-05 00:00:00 2.0
2019-04-05 00:00:01 NaN
2019-04-05 00:00:02 NaN
2019-04-05 00:00:03 NaN
2019-04-05 00:00:04 NaN
2019-04-05 00:00:05 NaN
2019-04-05 00:00:06 NaN
2019-04-05 00:00:07 NaN
2019-04-05 00:00:08 3.0
2019-04-05 00:00:09 NaN
2019-04-05 00:00:10 NaN
2019-04-05 00:00:11 NaN
2019-04-05 00:00:12 NaN
2019-04-05 00:00:13 NaN
2019-04-05 00:00:14 NaN
2019-04-05 00:00:15 NaN
2019-04-05 00:00:16 NaN
2019-04-05 00:00:17 NaN
2019-04-05 00:00:18 NaN
2019-04-05 00:00:19 NaN
2019-04-05 00:00:20 4.0
2019-04-05 00:00:21 NaN
2019-04-05 00:00:22 NaN
2019-04-05 00:00:23 NaN
2019-04-05 00:00:24 NaN
2019-04-05 00:00:25 NaN
2019-04-05 00:00:26 NaN
2019-04-05 00:00:27 NaN
2019-04-05 00:00:28 4.0
2019-04-05 00:00:29 NaN
2019-04-05 00:00:30 NaN
2019-04-05 00:00:31 NaN
7 秒点是任意的,我实际上大约每分钟都取值。到目前为止,这是我尝试过的方法:
df = df.resample('7s').first()
但这会生成以下数据帧:
a
2019-04-05 00:00:00 2.0
2019-04-05 00:00:07 3.0
2019-04-05 00:00:14 4.0
2019-04-05 00:00:21 5.0
2019-04-05 00:00:28 4.0
注意:我对这些点之间缺少 NaN
并不感到困扰,因为它们是隐含的。我只是对时间不满意,因为它每 7 秒强制一个值,而我只想禁止值彼此相差 7 秒以内,而不需要每 7 秒一个值。
为清楚起见,伊迪丝:
我不想要的数据框:
a
2019-04-05 00:00:00 2.0
2019-04-05 00:00:07 3.0
2019-04-05 00:00:14 4.0
2019-04-05 00:00:21 5.0
2019-04-05 00:00:28 4.0
我想要的数据框:
>>> df
a
2019-04-05 00:00:00 2.0
2019-04-05 00:00:01 NaN
2019-04-05 00:00:02 NaN
2019-04-05 00:00:03 NaN
2019-04-05 00:00:04 NaN
2019-04-05 00:00:05 NaN
2019-04-05 00:00:06 NaN
2019-04-05 00:00:07 NaN
2019-04-05 00:00:08 3.0
2019-04-05 00:00:09 NaN
2019-04-05 00:00:10 NaN
2019-04-05 00:00:11 NaN
2019-04-05 00:00:12 NaN
2019-04-05 00:00:13 NaN
2019-04-05 00:00:14 NaN
2019-04-05 00:00:15 NaN
2019-04-05 00:00:16 NaN
2019-04-05 00:00:17 NaN
2019-04-05 00:00:18 NaN
2019-04-05 00:00:19 NaN
2019-04-05 00:00:20 4.0
2019-04-05 00:00:21 NaN
2019-04-05 00:00:22 NaN
2019-04-05 00:00:23 NaN
2019-04-05 00:00:24 NaN
2019-04-05 00:00:25 NaN
2019-04-05 00:00:26 NaN
2019-04-05 00:00:27 NaN
2019-04-05 00:00:28 4.0
2019-04-05 00:00:29 NaN
2019-04-05 00:00:30 NaN
2019-04-05 00:00:31 NaN
或:
>>> df
a
2019-04-05 00:00:00 2.0
2019-04-05 00:00:08 3.0
2019-04-05 00:00:20 4.0
2019-04-05 00:00:28 4.0
你可以对你的数据帧进行上采样,你已经非常接近了;
df = df.resample('7s').first()
df = df.resample(rule='1s')
这将在添加的秒数上为新插入的行创建一个包含 NaN 的数据框。
这不是严格使用 pandas 方法,但它完成了工作。
c = 8
for index, row in df.iterrows():
c += 1
if c > 7 and not(np.isnan(row[0])):
c=0
else:
row[0] = np.nan
一旦应用于 df
将 return 所需的数据帧。
编辑:
对于 n
列的数据框,每 x
行一个值:
c = [x+1 for i in range(df.shape[1])]
for index, row in df.iterrows():
c = [i+1 for i in c]
for i in range(len(c)):
if c[i] > x and not(np.isnan(row[i])):
c[i] = 0
else:
row[i] = np.nan
第二次编辑:
上面假设每个时间值都有一个NaN
。以下适用于数据框中的空白:
c = [dt.datetime(1,1,1) for i in range(df.shape[1])]
for index, row in df.iterrows():
for i in range(len(c)):
if index.to_pydatetime() - c[i] > dt.timedelta(seconds=x) and not(np.isnan(row[i])):
c[i] = index.to_pydatetime()
else:
row[i] = np.nan
在重采样之前填充 NA 值怎么样?
df = df.fillna('something').resample('7s').first()
则不强制取值:
a
2019-04-05 00:00:00 2
2019-04-05 00:00:07 something
2019-04-05 00:00:14 something
2019-04-05 00:00:21 5
2019-04-05 00:00:28 4
请注意,如果您用 something
之类的字符串填充 NA,它会将整个列转换为 object
而不是 float
。所以如果你想维护数据类型,你可以使用 df.fillna(0)
而不是
df.loc[df.resample("7s").apply(lambda s: s.first_valid_index()).a]
如果你想用 NaN 填充中间值那么
df1 = df.loc[df.resample("7s").apply(lambda s: s.first_valid_index()).a]
df1.resample("1s").apply(lambda s: None if s.empty else s)
编辑:
根据说明,我们开始:
df[df.rolling(window="7s", closed='neither').sum().isna()]
使用上面显示的上采样代码将其填充为 NaN。
编辑-2
我们必须对行使用循环,因为发出值的决定取决于之前发出的值:
def f():
skip = 0
for row in df.itertuples():
if skip == 0:
if pd.notna(row.a):
yield row
skip = 7
else:
skip = skip - 1
pd.DataFrame(f())
假设我有以下数据框:
>>> df
a
2019-04-05 00:00:00 2.0
2019-04-05 00:00:01 1.0
2019-04-05 00:00:02 NaN
2019-04-05 00:00:03 NaN
2019-04-05 00:00:04 NaN
2019-04-05 00:00:05 NaN
2019-04-05 00:00:06 NaN
2019-04-05 00:00:07 NaN
2019-04-05 00:00:08 3.0
2019-04-05 00:00:09 4.0
2019-04-05 00:00:10 NaN
2019-04-05 00:00:11 NaN
2019-04-05 00:00:12 NaN
2019-04-05 00:00:13 NaN
2019-04-05 00:00:14 NaN
2019-04-05 00:00:15 NaN
2019-04-05 00:00:16 NaN
2019-04-05 00:00:17 NaN
2019-04-05 00:00:18 NaN
2019-04-05 00:00:19 NaN
2019-04-05 00:00:20 4.0
2019-04-05 00:00:21 5.0
2019-04-05 00:00:22 NaN
2019-04-05 00:00:23 NaN
2019-04-05 00:00:24 NaN
2019-04-05 00:00:25 NaN
2019-04-05 00:00:26 6.0
2019-04-05 00:00:27 NaN
2019-04-05 00:00:28 4.0
2019-04-05 00:00:29 NaN
2019-04-05 00:00:30 NaN
2019-04-05 00:00:31 NaN
我希望每 7 秒有 1 个值(假设有一个值,否则只是一个 NaN),因此数据框如下所示:
>>> df
a
2019-04-05 00:00:00 2.0
2019-04-05 00:00:01 NaN
2019-04-05 00:00:02 NaN
2019-04-05 00:00:03 NaN
2019-04-05 00:00:04 NaN
2019-04-05 00:00:05 NaN
2019-04-05 00:00:06 NaN
2019-04-05 00:00:07 NaN
2019-04-05 00:00:08 3.0
2019-04-05 00:00:09 NaN
2019-04-05 00:00:10 NaN
2019-04-05 00:00:11 NaN
2019-04-05 00:00:12 NaN
2019-04-05 00:00:13 NaN
2019-04-05 00:00:14 NaN
2019-04-05 00:00:15 NaN
2019-04-05 00:00:16 NaN
2019-04-05 00:00:17 NaN
2019-04-05 00:00:18 NaN
2019-04-05 00:00:19 NaN
2019-04-05 00:00:20 4.0
2019-04-05 00:00:21 NaN
2019-04-05 00:00:22 NaN
2019-04-05 00:00:23 NaN
2019-04-05 00:00:24 NaN
2019-04-05 00:00:25 NaN
2019-04-05 00:00:26 NaN
2019-04-05 00:00:27 NaN
2019-04-05 00:00:28 4.0
2019-04-05 00:00:29 NaN
2019-04-05 00:00:30 NaN
2019-04-05 00:00:31 NaN
7 秒点是任意的,我实际上大约每分钟都取值。到目前为止,这是我尝试过的方法:
df = df.resample('7s').first()
但这会生成以下数据帧:
a
2019-04-05 00:00:00 2.0
2019-04-05 00:00:07 3.0
2019-04-05 00:00:14 4.0
2019-04-05 00:00:21 5.0
2019-04-05 00:00:28 4.0
注意:我对这些点之间缺少 NaN
并不感到困扰,因为它们是隐含的。我只是对时间不满意,因为它每 7 秒强制一个值,而我只想禁止值彼此相差 7 秒以内,而不需要每 7 秒一个值。
为清楚起见,伊迪丝:
我不想要的数据框:
a
2019-04-05 00:00:00 2.0
2019-04-05 00:00:07 3.0
2019-04-05 00:00:14 4.0
2019-04-05 00:00:21 5.0
2019-04-05 00:00:28 4.0
我想要的数据框:
>>> df
a
2019-04-05 00:00:00 2.0
2019-04-05 00:00:01 NaN
2019-04-05 00:00:02 NaN
2019-04-05 00:00:03 NaN
2019-04-05 00:00:04 NaN
2019-04-05 00:00:05 NaN
2019-04-05 00:00:06 NaN
2019-04-05 00:00:07 NaN
2019-04-05 00:00:08 3.0
2019-04-05 00:00:09 NaN
2019-04-05 00:00:10 NaN
2019-04-05 00:00:11 NaN
2019-04-05 00:00:12 NaN
2019-04-05 00:00:13 NaN
2019-04-05 00:00:14 NaN
2019-04-05 00:00:15 NaN
2019-04-05 00:00:16 NaN
2019-04-05 00:00:17 NaN
2019-04-05 00:00:18 NaN
2019-04-05 00:00:19 NaN
2019-04-05 00:00:20 4.0
2019-04-05 00:00:21 NaN
2019-04-05 00:00:22 NaN
2019-04-05 00:00:23 NaN
2019-04-05 00:00:24 NaN
2019-04-05 00:00:25 NaN
2019-04-05 00:00:26 NaN
2019-04-05 00:00:27 NaN
2019-04-05 00:00:28 4.0
2019-04-05 00:00:29 NaN
2019-04-05 00:00:30 NaN
2019-04-05 00:00:31 NaN
或:
>>> df
a
2019-04-05 00:00:00 2.0
2019-04-05 00:00:08 3.0
2019-04-05 00:00:20 4.0
2019-04-05 00:00:28 4.0
你可以对你的数据帧进行上采样,你已经非常接近了;
df = df.resample('7s').first()
df = df.resample(rule='1s')
这将在添加的秒数上为新插入的行创建一个包含 NaN 的数据框。
这不是严格使用 pandas 方法,但它完成了工作。
c = 8
for index, row in df.iterrows():
c += 1
if c > 7 and not(np.isnan(row[0])):
c=0
else:
row[0] = np.nan
一旦应用于 df
将 return 所需的数据帧。
编辑:
对于 n
列的数据框,每 x
行一个值:
c = [x+1 for i in range(df.shape[1])]
for index, row in df.iterrows():
c = [i+1 for i in c]
for i in range(len(c)):
if c[i] > x and not(np.isnan(row[i])):
c[i] = 0
else:
row[i] = np.nan
第二次编辑:
上面假设每个时间值都有一个NaN
。以下适用于数据框中的空白:
c = [dt.datetime(1,1,1) for i in range(df.shape[1])]
for index, row in df.iterrows():
for i in range(len(c)):
if index.to_pydatetime() - c[i] > dt.timedelta(seconds=x) and not(np.isnan(row[i])):
c[i] = index.to_pydatetime()
else:
row[i] = np.nan
在重采样之前填充 NA 值怎么样?
df = df.fillna('something').resample('7s').first()
则不强制取值:
a
2019-04-05 00:00:00 2
2019-04-05 00:00:07 something
2019-04-05 00:00:14 something
2019-04-05 00:00:21 5
2019-04-05 00:00:28 4
请注意,如果您用 something
之类的字符串填充 NA,它会将整个列转换为 object
而不是 float
。所以如果你想维护数据类型,你可以使用 df.fillna(0)
而不是
df.loc[df.resample("7s").apply(lambda s: s.first_valid_index()).a]
如果你想用 NaN 填充中间值那么
df1 = df.loc[df.resample("7s").apply(lambda s: s.first_valid_index()).a]
df1.resample("1s").apply(lambda s: None if s.empty else s)
编辑:
根据说明,我们开始:
df[df.rolling(window="7s", closed='neither').sum().isna()]
使用上面显示的上采样代码将其填充为 NaN。
编辑-2
我们必须对行使用循环,因为发出值的决定取决于之前发出的值:
def f():
skip = 0
for row in df.itertuples():
if skip == 0:
if pd.notna(row.a):
yield row
skip = 7
else:
skip = skip - 1
pd.DataFrame(f())