基于多个指标重采样到等距时间
Resample into equidistant time based on multiple indices
我有一个数据框,其中包含一些类似于此的网络流
flow = {'date': ['2020-11-13 13:57:51','2020-11-13 13:57:51','2020-11-13 13:57:52','2020-11-13 13:59:53','2020-11-13 13:59:54'],
'source_ip': ['192.168.1.1','192.168.1.2','10.0.0.1','192.168.1.1','192.168.1.1'],
'destination_ip': ['10.0.0.1', '10.0.0.1', '192.168.1.1', '192.168.1.2', '192.168.1.2'],
'source_bytes':[5,1,2,3,3]
}
df = pd.DataFrame(flow, columns = ['date', 'source_ip', 'destination_ip', 'source_bytes']).set_index('date')
看起来像这样
date | source_ip | destination_ip| source_bytes
2020-11-13 13:57:51 | 192.168.1.1 | 10.0.0.1 | 5
2020-11-13 13:57:51 | 192.168.1.2 | 10.0.0.1 | 1
2020-11-13 13:57:52 | 10.0.0.1 | 192.168.1.1 | 2
2020-11-13 13:59:53 | 192.168.1.1 | 192.168.1.2 | 3
2020-11-13 13:59:54 | 192.168.1.2 | 192.168.1.1 | 3
我想将它们重新采样为 1 分钟的刻度,但也按 ip 分组。然后 source_bytes 需要聚合,无论 ip 是 source_ip 还是 destination_ip
应该变成那样吧。 (手动计算。希望这里没有做任何错误)。每分钟都应表示所有 ip,但如果没有值,则用零填充。
ip | date | source_bytes_sum
192.168.1.1 | 2020-11-13 13:57:00 | 7
192.168.1.2 | 2020-11-13 13:57:00 | 1
10.0.0.1 | 2020-11-13 13:57:00 | 8
192.168.1.1 | 2020-11-13 13:59:00 | 6
192.168.1.2 | 2020-11-13 13:59:00 | 6
10.0.0.1 | 2020-11-13 13:59:00 | 0
这里相同的表示只是 'grouped' 由 ip
ip | date | source_bytes_sum
192.168.1.1 | 2020-11-13 13:57:00 | 7
| 2020-11-13 13:59:00 | 6
192.168.1.2 | 2020-11-13 13:57:00 | 1
| 2020-11-13 13:59:00 | 6
10.0.0.1 | 2020-11-13 13:57:00 | 8
| 2020-11-13 13:59:00 | 0
我开始尝试以下方法,但仅按 source_ip 分组并忽略 destination_ip。它也不添加零值
grouped = df.groupby(['source_ip', pd.Grouper(key='date', freq='1min')])[['source_bytes']].agg(['sum'])
grouped
source_bytes
sum
source_ip date
10.0.0.1 2020-11-13 13:57:00 2
192.168.1.1 2020-11-13 13:57:00 5
2020-11-13 13:59:00 6
192.168.1.2 2020-11-13 13:57:00 1
首先使用 unpivot by DataFrame.melt
, then use your solution with Grouper
and for 0
values add Series.unstack
with DataFrame.stack
:
df = (df.melt(['date', 'source_bytes'], value_name='ip')
.groupby(['ip', pd.Grouper(key='date', freq='1min')])['source_bytes']
.sum()
.unstack(fill_value=0)
.stack()
.reset_index(name='sum'))
print (df)
ip date sum
0 10.0.0.1 2020-11-13 13:57:00 8
1 10.0.0.1 2020-11-13 13:59:00 0
2 192.168.1.1 2020-11-13 13:57:00 7
3 192.168.1.1 2020-11-13 13:59:00 6
4 192.168.1.2 2020-11-13 13:57:00 1
5 192.168.1.2 2020-11-13 13:59:00 6
或使用DataFrame.stack
with appended source_bytes
to MultiIndex
with DataFrame.set_index
:
df = (df.set_index(['date','source_bytes'])
.stack()
.reset_index(name='ip')
.groupby(['ip', pd.Grouper(key='date', freq='1min')])['source_bytes']
.sum()
.unstack(fill_value=0)
.stack()
.reset_index(name='sum')
)
print (df)
ip date sum
0 10.0.0.1 2020-11-13 13:57:00 8
1 10.0.0.1 2020-11-13 13:59:00 0
2 192.168.1.1 2020-11-13 13:57:00 7
3 192.168.1.1 2020-11-13 13:59:00 6
4 192.168.1.2 2020-11-13 13:57:00 1
5 192.168.1.2 2020-11-13 13:59:00 6
编辑:要使用更多聚合函数,请使用:
df = pd.DataFrame(flow, columns = ['date', 'source_ip', 'destination_ip', 'source_bytes'])
df['date'] = pd.to_datetime(df['date'])
df2 = (df.melt(['date', 'source_bytes'], value_name='ip')
.groupby(['ip', pd.Grouper(key='date', freq='1min')])['source_bytes']
.agg(['sum','min','mean'])
.unstack(fill_value=0)
.stack()
.reset_index()
)
print (df2)
ip date sum min mean
0 10.0.0.1 2020-11-13 13:57:00 8 1 2.666667
1 10.0.0.1 2020-11-13 13:59:00 0 0 0.000000
2 192.168.1.1 2020-11-13 13:57:00 7 2 3.500000
3 192.168.1.1 2020-11-13 13:59:00 6 3 3.000000
4 192.168.1.2 2020-11-13 13:57:00 1 1 1.000000
5 192.168.1.2 2020-11-13 13:59:00 6 3 3.000000
我有一个数据框,其中包含一些类似于此的网络流
flow = {'date': ['2020-11-13 13:57:51','2020-11-13 13:57:51','2020-11-13 13:57:52','2020-11-13 13:59:53','2020-11-13 13:59:54'],
'source_ip': ['192.168.1.1','192.168.1.2','10.0.0.1','192.168.1.1','192.168.1.1'],
'destination_ip': ['10.0.0.1', '10.0.0.1', '192.168.1.1', '192.168.1.2', '192.168.1.2'],
'source_bytes':[5,1,2,3,3]
}
df = pd.DataFrame(flow, columns = ['date', 'source_ip', 'destination_ip', 'source_bytes']).set_index('date')
看起来像这样
date | source_ip | destination_ip| source_bytes
2020-11-13 13:57:51 | 192.168.1.1 | 10.0.0.1 | 5
2020-11-13 13:57:51 | 192.168.1.2 | 10.0.0.1 | 1
2020-11-13 13:57:52 | 10.0.0.1 | 192.168.1.1 | 2
2020-11-13 13:59:53 | 192.168.1.1 | 192.168.1.2 | 3
2020-11-13 13:59:54 | 192.168.1.2 | 192.168.1.1 | 3
我想将它们重新采样为 1 分钟的刻度,但也按 ip 分组。然后 source_bytes 需要聚合,无论 ip 是 source_ip 还是 destination_ip
应该变成那样吧。 (手动计算。希望这里没有做任何错误)。每分钟都应表示所有 ip,但如果没有值,则用零填充。
ip | date | source_bytes_sum
192.168.1.1 | 2020-11-13 13:57:00 | 7
192.168.1.2 | 2020-11-13 13:57:00 | 1
10.0.0.1 | 2020-11-13 13:57:00 | 8
192.168.1.1 | 2020-11-13 13:59:00 | 6
192.168.1.2 | 2020-11-13 13:59:00 | 6
10.0.0.1 | 2020-11-13 13:59:00 | 0
这里相同的表示只是 'grouped' 由 ip
ip | date | source_bytes_sum
192.168.1.1 | 2020-11-13 13:57:00 | 7
| 2020-11-13 13:59:00 | 6
192.168.1.2 | 2020-11-13 13:57:00 | 1
| 2020-11-13 13:59:00 | 6
10.0.0.1 | 2020-11-13 13:57:00 | 8
| 2020-11-13 13:59:00 | 0
我开始尝试以下方法,但仅按 source_ip 分组并忽略 destination_ip。它也不添加零值
grouped = df.groupby(['source_ip', pd.Grouper(key='date', freq='1min')])[['source_bytes']].agg(['sum'])
grouped
source_bytes
sum
source_ip date
10.0.0.1 2020-11-13 13:57:00 2
192.168.1.1 2020-11-13 13:57:00 5
2020-11-13 13:59:00 6
192.168.1.2 2020-11-13 13:57:00 1
首先使用 unpivot by DataFrame.melt
, then use your solution with Grouper
and for 0
values add Series.unstack
with DataFrame.stack
:
df = (df.melt(['date', 'source_bytes'], value_name='ip')
.groupby(['ip', pd.Grouper(key='date', freq='1min')])['source_bytes']
.sum()
.unstack(fill_value=0)
.stack()
.reset_index(name='sum'))
print (df)
ip date sum
0 10.0.0.1 2020-11-13 13:57:00 8
1 10.0.0.1 2020-11-13 13:59:00 0
2 192.168.1.1 2020-11-13 13:57:00 7
3 192.168.1.1 2020-11-13 13:59:00 6
4 192.168.1.2 2020-11-13 13:57:00 1
5 192.168.1.2 2020-11-13 13:59:00 6
或使用DataFrame.stack
with appended source_bytes
to MultiIndex
with DataFrame.set_index
:
df = (df.set_index(['date','source_bytes'])
.stack()
.reset_index(name='ip')
.groupby(['ip', pd.Grouper(key='date', freq='1min')])['source_bytes']
.sum()
.unstack(fill_value=0)
.stack()
.reset_index(name='sum')
)
print (df)
ip date sum
0 10.0.0.1 2020-11-13 13:57:00 8
1 10.0.0.1 2020-11-13 13:59:00 0
2 192.168.1.1 2020-11-13 13:57:00 7
3 192.168.1.1 2020-11-13 13:59:00 6
4 192.168.1.2 2020-11-13 13:57:00 1
5 192.168.1.2 2020-11-13 13:59:00 6
编辑:要使用更多聚合函数,请使用:
df = pd.DataFrame(flow, columns = ['date', 'source_ip', 'destination_ip', 'source_bytes'])
df['date'] = pd.to_datetime(df['date'])
df2 = (df.melt(['date', 'source_bytes'], value_name='ip')
.groupby(['ip', pd.Grouper(key='date', freq='1min')])['source_bytes']
.agg(['sum','min','mean'])
.unstack(fill_value=0)
.stack()
.reset_index()
)
print (df2)
ip date sum min mean
0 10.0.0.1 2020-11-13 13:57:00 8 1 2.666667
1 10.0.0.1 2020-11-13 13:59:00 0 0 0.000000
2 192.168.1.1 2020-11-13 13:57:00 7 2 3.500000
3 192.168.1.1 2020-11-13 13:59:00 6 3 3.000000
4 192.168.1.2 2020-11-13 13:57:00 1 1 1.000000
5 192.168.1.2 2020-11-13 13:59:00 6 3 3.000000