Pandas - 带滑动的条件列 window
Pandas - conditional column with sliding window
我有一个包含两列的 df - 时间戳和文本。我正在尝试使用 True/false (1/0) 标签来标记数据。条件是,如果文本中存在单词 "error",则在条目之前 3-4 小时之间的所有条目都应获得 1 标签,其他标签为 0。例如。来自这样的 df:
time text
15:00 a-ok
16:01 fine
17:00 kay
18:00 uhum
19:00 doin well
20:00 is error
20:05 still error
21:00 fine again
应转化为:
time text error coming
15:00 a-ok 0
16:01 fine 1
17:00 kay 1
18:00 uhum 1
19:00 doin well 1
20:00 is error 0
20:05 still error0
21:00 fine again 0
我读到一些关于滑动 windows 和 .rolling
的东西,但我很难把它们放在一起。
想法是将时间转换为 timedeltas,过滤有错误的 timedeltas,并为每个值创建带有 logical_or.reduce
的掩码,带有反转 m1
的链式掩码以避免 error
s 值并转换为True/False
到 1/0
映射的整数:
td = pd.to_timedelta(df['time'].astype(str) + ':00')
m1 = df['text'].str.contains('error')
v = td[m1]
print (v)
5 20:00:00
6 20:05:00
Name: time, dtype: timedelta64[ns]
m2 = np.logical_or.reduce([td.between(x - pd.Timedelta(4, unit='h'), x) for x in v])
df['error coming'] = (m2 & ~m1).astype(int)
print (df)
time text error coming
0 15:00 a-ok 0
1 16:01 fine 1
2 17:00 kay 1
3 18:00 uhum 1
4 19:00 doin well 1
5 20:00 is error 0
6 20:05 still error 0
7 21:00 fine again 0
编辑:
df['time'] = pd.to_datetime(df['time'])
print (df)
time text
0 2019-01-26 15:00:00 a-ok
1 2019-01-26 16:01:00 fine
2 2019-01-26 17:00:00 kay
3 2019-01-26 18:00:00 uhum
4 2019-01-26 19:00:00 doin well
5 2019-01-26 20:00:00 is error
6 2019-01-26 20:05:00 still error
7 2019-01-26 21:00:00 fine again
print (df.dtypes)
time datetime64[ns]
text object
dtype: object
m1 = df['text'].str.contains('error')
v = df.loc[m1, 'time']
print (v)
5 2019-01-26 20:00:00
6 2019-01-26 20:05:00
Name: time, dtype: datetime64[ns]
m2 = np.logical_or.reduce([df['time'].between(x - pd.Timedelta(4, unit='h'), x) for x in v])
df['error coming'] = (m2 & ~m1).astype(int)
print (df)
time text error coming
0 2019-01-26 15:00:00 a-ok 0
1 2019-01-26 16:01:00 fine 1
2 2019-01-26 17:00:00 kay 1
3 2019-01-26 18:00:00 uhum 1
4 2019-01-26 19:00:00 doin well 1
5 2019-01-26 20:00:00 is error 0
6 2019-01-26 20:05:00 still error 0
7 2019-01-26 21:00:00 fine again 0
矢量化解决方案:
m1 = df['text'].str.contains('error')
v = df.loc[m1, 'time']
print (v)
5 2019-01-26 20:00:00
6 2019-01-26 20:05:00
Name: time, dtype: datetime64[ns]
a = v - pd.Timedelta(4, unit='h')
m = (a.values < df['time'].values[:, None]) & (v.values > df['time'].values[:, None])
df['error coming'] = (m.any(axis=1) & ~m1).astype(int)
print (df)
time text error coming
0 2019-01-26 15:00:00 a-ok 0
1 2019-01-26 16:01:00 fine 1
2 2019-01-26 17:00:00 kay 1
3 2019-01-26 18:00:00 uhum 1
4 2019-01-26 19:00:00 doin well 1
5 2019-01-26 20:00:00 is error 0
6 2019-01-26 20:05:00 still error 0
7 2019-01-26 21:00:00 fine again 0
我有一个包含两列的 df - 时间戳和文本。我正在尝试使用 True/false (1/0) 标签来标记数据。条件是,如果文本中存在单词 "error",则在条目之前 3-4 小时之间的所有条目都应获得 1 标签,其他标签为 0。例如。来自这样的 df:
time text
15:00 a-ok
16:01 fine
17:00 kay
18:00 uhum
19:00 doin well
20:00 is error
20:05 still error
21:00 fine again
应转化为:
time text error coming
15:00 a-ok 0
16:01 fine 1
17:00 kay 1
18:00 uhum 1
19:00 doin well 1
20:00 is error 0
20:05 still error0
21:00 fine again 0
我读到一些关于滑动 windows 和 .rolling
的东西,但我很难把它们放在一起。
想法是将时间转换为 timedeltas,过滤有错误的 timedeltas,并为每个值创建带有 logical_or.reduce
的掩码,带有反转 m1
的链式掩码以避免 error
s 值并转换为True/False
到 1/0
映射的整数:
td = pd.to_timedelta(df['time'].astype(str) + ':00')
m1 = df['text'].str.contains('error')
v = td[m1]
print (v)
5 20:00:00
6 20:05:00
Name: time, dtype: timedelta64[ns]
m2 = np.logical_or.reduce([td.between(x - pd.Timedelta(4, unit='h'), x) for x in v])
df['error coming'] = (m2 & ~m1).astype(int)
print (df)
time text error coming
0 15:00 a-ok 0
1 16:01 fine 1
2 17:00 kay 1
3 18:00 uhum 1
4 19:00 doin well 1
5 20:00 is error 0
6 20:05 still error 0
7 21:00 fine again 0
编辑:
df['time'] = pd.to_datetime(df['time'])
print (df)
time text
0 2019-01-26 15:00:00 a-ok
1 2019-01-26 16:01:00 fine
2 2019-01-26 17:00:00 kay
3 2019-01-26 18:00:00 uhum
4 2019-01-26 19:00:00 doin well
5 2019-01-26 20:00:00 is error
6 2019-01-26 20:05:00 still error
7 2019-01-26 21:00:00 fine again
print (df.dtypes)
time datetime64[ns]
text object
dtype: object
m1 = df['text'].str.contains('error')
v = df.loc[m1, 'time']
print (v)
5 2019-01-26 20:00:00
6 2019-01-26 20:05:00
Name: time, dtype: datetime64[ns]
m2 = np.logical_or.reduce([df['time'].between(x - pd.Timedelta(4, unit='h'), x) for x in v])
df['error coming'] = (m2 & ~m1).astype(int)
print (df)
time text error coming
0 2019-01-26 15:00:00 a-ok 0
1 2019-01-26 16:01:00 fine 1
2 2019-01-26 17:00:00 kay 1
3 2019-01-26 18:00:00 uhum 1
4 2019-01-26 19:00:00 doin well 1
5 2019-01-26 20:00:00 is error 0
6 2019-01-26 20:05:00 still error 0
7 2019-01-26 21:00:00 fine again 0
矢量化解决方案:
m1 = df['text'].str.contains('error')
v = df.loc[m1, 'time']
print (v)
5 2019-01-26 20:00:00
6 2019-01-26 20:05:00
Name: time, dtype: datetime64[ns]
a = v - pd.Timedelta(4, unit='h')
m = (a.values < df['time'].values[:, None]) & (v.values > df['time'].values[:, None])
df['error coming'] = (m.any(axis=1) & ~m1).astype(int)
print (df)
time text error coming
0 2019-01-26 15:00:00 a-ok 0
1 2019-01-26 16:01:00 fine 1
2 2019-01-26 17:00:00 kay 1
3 2019-01-26 18:00:00 uhum 1
4 2019-01-26 19:00:00 doin well 1
5 2019-01-26 20:00:00 is error 0
6 2019-01-26 20:05:00 still error 0
7 2019-01-26 21:00:00 fine again 0