pandas 客户 table >> 将边添加到节点列和增量时间

Question

我有一个数据table我想创建一个图表。（粘贴数据示例见文末）为此，我想创建节点和边缘。每个客户都会经历不同的流程状态。边连接两个状态（节点）我的目标是获得如 excel table 屏幕截图所示的边缘和每次变化的增量时间。

我的代码：首先，我按客户端和时间戳对 table 进行排序（即节点（状态）从 t1 到 t2 再到 t3 ... 其中 t1

estados=estados.sort_values(['CLIENT', 'timestamp'], ascending=[True, True])

现在遵循 20% pythonic 代码和 0% pandonic 代码：

edges_column = []
delta_column = []
for client in list_of_clients:
    client_df = estados.loc[estados['CLIENT'] == client,['node','timestamp']]
    client_nodes      = client_df['node']
    client_timestamps = client_df['timestamp']
    list_edges        = [node1 + '-' + node2 for node1,node2 in  zip(client_nodes[:-1],client_nodes[1:])]
    list_delta_times  = [t2 -t1 for t1,t2 in  zip(client_timestamps[:-1],client_timestamps[1:])]
    print(list_edges)
    print(list_delta_times)
    # adding ['-'] because if there are n nodes there are n-1 edges. the same for delta times
    edges_column = edges_column + list_edges + ['-']
    delta_column = delta_column + list_delta_times + ['-']

# adding the columns edges_column and delta_column
print(len(edges_column))
estados['edge']      = edges_column
estados['deltatime'] = delta_column

此代码有效，但远非理想。这应该是一个很常见的问题。我需要一个更高效的代码，因为我有 50 万行，它应该在合理的时间内运行。

我正在寻找一个函数来创建列边缘和时间戳我无法得出这样的解决方案，因为该函数指的是两个不同行中的值，而不仅仅是指一个值，在这种情况下，我可以做类似

的事情

estados['edge'] = estados['node'].apply(function)

因为我必须传递两个值而不是一个。

有没有不用 for 循环的方法？

谢谢。

table 的格式是 pandas：

注意 1 用于复制和粘贴 JSON 文件：{"CLIENT":{"0":"client1","1":"client1","2":"client1","3":"client1","4":"client2","5":"client2","6":"client2","7":"client3","8":"client4","9":"client4","10":"client4","11":"client4","12":"client4","13":"client4"},"node":{"0":"A","1":"B","2":"C","3":"H","4":"B","5":"F","6":"G","7":"C","8":"D","9":"E","10":"F","11":"H","12":"G","13":"K"},"timestamp":{"0":1590684862000,"1":1590771262270,"2":1590857662000,"3":1590598462000,"4":1590425662000 “5”：1590512062000，“6”：1590598462000，“7”：1590771262270，“8”：1588352062000，“9”：1588524862000，“10”：1588611262000，“11”：1588697662000：1856200，“65829”5 13":1589043262000}}

Answer 1

你可以在这里使用df.shift with pd.Series.str.cat。

df['result'] = df.groupby('CLIENT').node.shift(1).str.cat(df.node,'-')
df

     CLIENT node      timestamp result
0   client1    A  1590684862000    NaN
1   client1    B  1590771262270    A-B
2   client1    C  1590857662000    B-C
3   client1    H  1590598462000    C-H
4   client2    B  1590425662000    NaN
5   client2    F  1590512062000    B-F
6   client2    G  1590598462000    F-G
7   client3    C  1590771262270    NaN
8   client4    D  1588352062000    NaN
9   client4    E  1588524862000    D-E
10  client4    F  1588611262000    E-F
11  client4    H  1588697662000    F-H
12  client4    G  1588956862000    H-G
13  client4    K  1589043262000    G-K

pandas 客户 table >> 将边添加到节点列和增量时间

pandas clients table >> add edges to a nodes column and deltatimes

python

timedelta

pandas