pandas 客户 table >> 将边添加到节点列和增量时间
pandas clients table >> add edges to a nodes column and deltatimes
我有一个数据table我想创建一个图表。 (粘贴数据示例见文末)
为此,我想创建节点和边缘。
每个客户都会经历不同的流程状态。
边连接两个状态(节点)
我的目标是获得如 excel table 屏幕截图所示的边缘和每次变化的增量时间。
我的代码:
首先,我按客户端和时间戳对 table 进行排序(即节点(状态)从 t1 到 t2 再到 t3 ... 其中 t1
estados=estados.sort_values(['CLIENT', 'timestamp'], ascending=[True, True])
现在遵循 20% pythonic 代码和 0% pandonic 代码:
edges_column = []
delta_column = []
for client in list_of_clients:
client_df = estados.loc[estados['CLIENT'] == client,['node','timestamp']]
client_nodes = client_df['node']
client_timestamps = client_df['timestamp']
list_edges = [node1 + '-' + node2 for node1,node2 in zip(client_nodes[:-1],client_nodes[1:])]
list_delta_times = [t2 -t1 for t1,t2 in zip(client_timestamps[:-1],client_timestamps[1:])]
print(list_edges)
print(list_delta_times)
# adding ['-'] because if there are n nodes there are n-1 edges. the same for delta times
edges_column = edges_column + list_edges + ['-']
delta_column = delta_column + list_delta_times + ['-']
# adding the columns edges_column and delta_column
print(len(edges_column))
estados['edge'] = edges_column
estados['deltatime'] = delta_column
此代码有效,但远非理想。
这应该是一个很常见的问题。我需要一个更高效的代码,因为我有 50 万行,它应该在合理的时间内 运行。
我正在寻找一个函数来创建列边缘和时间戳
我无法得出这样的解决方案,因为该函数指的是两个不同行中的值,而不仅仅是指一个值,在这种情况下,我可以做类似
的事情
estados['edge'] = estados['node'].apply(function)
因为我必须传递两个值而不是一个。
有没有不用 for 循环的方法?
谢谢。
table 的格式是 pandas:
注意 1 用于复制和粘贴 JSON 文件:{"CLIENT":{"0":"client1","1":"client1","2":"client1","3":"client1","4":"client2","5":"client2","6":"client2","7":"client3","8":"client4","9":"client4","10":"client4","11":"client4","12":"client4","13":"client4"},"node":{"0":"A","1":"B","2":"C","3":"H","4":"B","5":"F","6":"G","7":"C","8":"D","9":"E","10":"F","11":"H","12":"G","13":"K"},"timestamp":{"0":1590684862000,"1":1590771262270,"2":1590857662000,"3":1590598462000,"4":1590425662000 “5”:1590512062000,“6”:1590598462000,“7”:1590771262270,“8”:1588352062000,“9”:1588524862000,“10”:1588611262000,“11”:1588697662000:1856200,“65829”5 13":1589043262000}}
你可以在这里使用df.shift
with pd.Series.str.cat
。
df['result'] = df.groupby('CLIENT').node.shift(1).str.cat(df.node,'-')
df
CLIENT node timestamp result
0 client1 A 1590684862000 NaN
1 client1 B 1590771262270 A-B
2 client1 C 1590857662000 B-C
3 client1 H 1590598462000 C-H
4 client2 B 1590425662000 NaN
5 client2 F 1590512062000 B-F
6 client2 G 1590598462000 F-G
7 client3 C 1590771262270 NaN
8 client4 D 1588352062000 NaN
9 client4 E 1588524862000 D-E
10 client4 F 1588611262000 E-F
11 client4 H 1588697662000 F-H
12 client4 G 1588956862000 H-G
13 client4 K 1589043262000 G-K
我有一个数据table我想创建一个图表。 (粘贴数据示例见文末) 为此,我想创建节点和边缘。 每个客户都会经历不同的流程状态。 边连接两个状态(节点) 我的目标是获得如 excel table 屏幕截图所示的边缘和每次变化的增量时间。
我的代码: 首先,我按客户端和时间戳对 table 进行排序(即节点(状态)从 t1 到 t2 再到 t3 ... 其中 t1
estados=estados.sort_values(['CLIENT', 'timestamp'], ascending=[True, True])
现在遵循 20% pythonic 代码和 0% pandonic 代码:
edges_column = []
delta_column = []
for client in list_of_clients:
client_df = estados.loc[estados['CLIENT'] == client,['node','timestamp']]
client_nodes = client_df['node']
client_timestamps = client_df['timestamp']
list_edges = [node1 + '-' + node2 for node1,node2 in zip(client_nodes[:-1],client_nodes[1:])]
list_delta_times = [t2 -t1 for t1,t2 in zip(client_timestamps[:-1],client_timestamps[1:])]
print(list_edges)
print(list_delta_times)
# adding ['-'] because if there are n nodes there are n-1 edges. the same for delta times
edges_column = edges_column + list_edges + ['-']
delta_column = delta_column + list_delta_times + ['-']
# adding the columns edges_column and delta_column
print(len(edges_column))
estados['edge'] = edges_column
estados['deltatime'] = delta_column
此代码有效,但远非理想。 这应该是一个很常见的问题。我需要一个更高效的代码,因为我有 50 万行,它应该在合理的时间内 运行。
我正在寻找一个函数来创建列边缘和时间戳 我无法得出这样的解决方案,因为该函数指的是两个不同行中的值,而不仅仅是指一个值,在这种情况下,我可以做类似
的事情estados['edge'] = estados['node'].apply(function)
因为我必须传递两个值而不是一个。
有没有不用 for 循环的方法?
谢谢。
table 的格式是 pandas:
注意 1 用于复制和粘贴 JSON 文件:{"CLIENT":{"0":"client1","1":"client1","2":"client1","3":"client1","4":"client2","5":"client2","6":"client2","7":"client3","8":"client4","9":"client4","10":"client4","11":"client4","12":"client4","13":"client4"},"node":{"0":"A","1":"B","2":"C","3":"H","4":"B","5":"F","6":"G","7":"C","8":"D","9":"E","10":"F","11":"H","12":"G","13":"K"},"timestamp":{"0":1590684862000,"1":1590771262270,"2":1590857662000,"3":1590598462000,"4":1590425662000 “5”:1590512062000,“6”:1590598462000,“7”:1590771262270,“8”:1588352062000,“9”:1588524862000,“10”:1588611262000,“11”:1588697662000:1856200,“65829”5 13":1589043262000}}
你可以在这里使用df.shift
with pd.Series.str.cat
。
df['result'] = df.groupby('CLIENT').node.shift(1).str.cat(df.node,'-')
df
CLIENT node timestamp result
0 client1 A 1590684862000 NaN
1 client1 B 1590771262270 A-B
2 client1 C 1590857662000 B-C
3 client1 H 1590598462000 C-H
4 client2 B 1590425662000 NaN
5 client2 F 1590512062000 B-F
6 client2 G 1590598462000 F-G
7 client3 C 1590771262270 NaN
8 client4 D 1588352062000 NaN
9 client4 E 1588524862000 D-E
10 client4 F 1588611262000 E-F
11 client4 H 1588697662000 F-H
12 client4 G 1588956862000 H-G
13 client4 K 1589043262000 G-K