Pandas 按列对索引(5 元组)
Pandas indexing by column pairs (5-tuple)
我正在尝试为网络 5 元组设置流 ID,原始数据帧如下所示:
tup = [['192.168.0.1', '1032', '192.168.0.2', '443'],
['192.168.0.1', '1032', '192.168.0.2', '443'],
['192.168.0.1', '1034', '192.168.0.2', '443'],
['192.168.0.2', '443', '192.168.0.1', '1034'],
['192.168.0.1', '1034', '192.168.0.2', '443'],
['192.168.0.1', '1034', '192.168.0.2', '443'],
['192.168.0.2', '443', '192.168.0.1', '1034'],
['192.168.0.2', '443', '192.168.0.1', '1034'],
['192.168.0.1', '1032', '192.168.0.2', '443'],
['192.168.0.2', '443', '192.168.0.1', '1032']]
df = pd.DataFrame(tup,columns=['src','src_port','dst','dst_port'])
对于来自同一流 (inbound/outbound) 的流量,流 ID 应设置为:
src src_port dst dst_port flow_id
0 192.168.0.1 1032 192.168.0.2 443 1
1 192.168.0.1 1032 192.168.0.2 443 1
2 192.168.0.1 1034 192.168.0.2 443 2
3 192.168.0.2 443 192.168.0.1 1034 2
4 192.168.0.1 1034 192.168.0.2 443 2
5 192.168.0.1 1034 192.168.0.2 443 2
6 192.168.0.2 443 192.168.0.1 1034 2
7 192.168.0.2 443 192.168.0.1 1034 2
8 192.168.0.1 1032 192.168.0.2 443 1
9 192.168.0.2 443 192.168.0.1 1032 1
我将数据帧转换为值并将它们排序在一起,但坚持设置正确的流索引。
有什么faster/elegant方法吗?
一个想法是成对排序-嵌套元组然后调用factorize
:
a = df[['src','src_port','dst','dst_port']].to_numpy()
s = [tuple(sorted(((x[0], x[1]), (x[2], x[3])))) for x in a]
df['flow_id'] = pd.factorize(s)[0] + 1
print (df)
src src_port dst dst_port flow_id
0 192.168.0.1 1032 192.168.0.2 443 1
1 192.168.0.1 1032 192.168.0.2 443 1
2 192.168.0.1 1034 192.168.0.2 443 2
3 192.168.0.2 443 192.168.0.1 1034 2
4 192.168.0.1 1034 192.168.0.2 443 2
5 192.168.0.1 1034 192.168.0.2 443 2
6 192.168.0.2 443 192.168.0.1 1034 2
7 192.168.0.2 443 192.168.0.1 1034 2
8 192.168.0.1 1032 192.168.0.2 443 1
9 192.168.0.2 443 192.168.0.1 1032 1
我正在尝试为网络 5 元组设置流 ID,原始数据帧如下所示:
tup = [['192.168.0.1', '1032', '192.168.0.2', '443'],
['192.168.0.1', '1032', '192.168.0.2', '443'],
['192.168.0.1', '1034', '192.168.0.2', '443'],
['192.168.0.2', '443', '192.168.0.1', '1034'],
['192.168.0.1', '1034', '192.168.0.2', '443'],
['192.168.0.1', '1034', '192.168.0.2', '443'],
['192.168.0.2', '443', '192.168.0.1', '1034'],
['192.168.0.2', '443', '192.168.0.1', '1034'],
['192.168.0.1', '1032', '192.168.0.2', '443'],
['192.168.0.2', '443', '192.168.0.1', '1032']]
df = pd.DataFrame(tup,columns=['src','src_port','dst','dst_port'])
对于来自同一流 (inbound/outbound) 的流量,流 ID 应设置为:
src src_port dst dst_port flow_id
0 192.168.0.1 1032 192.168.0.2 443 1
1 192.168.0.1 1032 192.168.0.2 443 1
2 192.168.0.1 1034 192.168.0.2 443 2
3 192.168.0.2 443 192.168.0.1 1034 2
4 192.168.0.1 1034 192.168.0.2 443 2
5 192.168.0.1 1034 192.168.0.2 443 2
6 192.168.0.2 443 192.168.0.1 1034 2
7 192.168.0.2 443 192.168.0.1 1034 2
8 192.168.0.1 1032 192.168.0.2 443 1
9 192.168.0.2 443 192.168.0.1 1032 1
我将数据帧转换为值并将它们排序在一起,但坚持设置正确的流索引。
有什么faster/elegant方法吗?
一个想法是成对排序-嵌套元组然后调用factorize
:
a = df[['src','src_port','dst','dst_port']].to_numpy()
s = [tuple(sorted(((x[0], x[1]), (x[2], x[3])))) for x in a]
df['flow_id'] = pd.factorize(s)[0] + 1
print (df)
src src_port dst dst_port flow_id
0 192.168.0.1 1032 192.168.0.2 443 1
1 192.168.0.1 1032 192.168.0.2 443 1
2 192.168.0.1 1034 192.168.0.2 443 2
3 192.168.0.2 443 192.168.0.1 1034 2
4 192.168.0.1 1034 192.168.0.2 443 2
5 192.168.0.1 1034 192.168.0.2 443 2
6 192.168.0.2 443 192.168.0.1 1034 2
7 192.168.0.2 443 192.168.0.1 1034 2
8 192.168.0.1 1032 192.168.0.2 443 1
9 192.168.0.2 443 192.168.0.1 1032 1