在数据框中定义友谊

Question

我有一项任务需要确定是否存在友好连接。让我解释一下，工作中有一个检查站。通过它的员工进入数据库，其中记录了他的通过时间和姓名。如果员工经常与同一个人经过该点，则可以以一定概率假设他们之间存在友好关系。还要考虑到他们经过的时间差，如果时间差很大，那他们很可能连面都没见过。比如我做了一个小时间系列：

    import pandas as pd
dict_df={
    'Data':['2020-02-10 10:00:23','2020-02-10 10:01:23','2020-02-10 10:01:30','2020-02-10 10:01:43',
            '2020-02-10 10:02:02','2020-02-10 10:02:30','2020-02-10 10:02:35','2020-02-10 10:02:50',
            '2020-02-10 10:02:58','2020-02-10 10:03:02','2020-02-10 10:03:10','2020-02-10 10:03:15',
            '2020-02-10 10:03:26','2020-02-10 10:03:32','2020-02-10 10:03:38','2020-02-10 10:03:40',
            '2020-02-10 10:03:46','2020-02-10 10:03:50','2020-02-10 10:04:04','2020-02-10 10:04:12',
            '2020-02-10 10:04:23','2020-02-10 10:04:27','2020-02-10 10:04:45','2020-02-10 10:04:50',
            '2020-02-10 10:04:59','2020-02-10 10:05:20','2020-02-10 10:05:26','2020-02-10 10:05:40',
            '2020-02-10 10:05:56','2020-02-10 10:06:12','2020-02-10 10:06:18','2020-02-10 10:06:30',
            '2020-02-10 10:06:37'],
    'Name':['Ann','Jhon','Chase','Bruce','Evan','Fred','Hugh','Gregory','Jack','Caleb','Eric','James',
            'Ann','Gerld','Jess','Juan','Luke','Kyle','Neil','Owen','James','Eric','Jhon','Jess','Norman',
            'Hugh','Fred','Gregory','Ryan','Angel','Cole','James','Eric']}

df=pd.DataFrame(dict_df)

这是它的样子：

 Data                Name
0   2020-02-10 10:00:23 Ann
1   2020-02-10 10:01:23 Jhon
2   2020-02-10 10:01:30 Chase
3   2020-02-10 10:01:43 Bruce
4   2020-02-10 10:02:02 Evan
5   2020-02-10 10:02:30 Fred
6   2020-02-10 10:02:35 Hugh
7   2020-02-10 10:02:50 Gregory
8   2020-02-10 10:02:58 Jack
9   2020-02-10 10:03:02 Caleb
10  2020-02-10 10:03:10 Eric
11  2020-02-10 10:03:15 James
12  2020-02-10 10:03:26 Ann
13  2020-02-10 10:03:32 Gerld
14  2020-02-10 10:03:38 Jess
15  2020-02-10 10:03:40 Juan
16  2020-02-10 10:03:46 Luke
17  2020-02-10 10:03:50 Kyle
18  2020-02-10 10:04:04 Neil
19  2020-02-10 10:04:12 Owen
20  2020-02-10 10:04:23 James
21  2020-02-10 10:04:27 Eric
22  2020-02-10 10:04:45 Jhon
23  2020-02-10 10:04:50 Jess
24  2020-02-10 10:04:59 Norman
25  2020-02-10 10:05:20 Hugh
26  2020-02-10 10:05:26 Fred
27  2020-02-10 10:05:40 Gregory
28  2020-02-10 10:05:56 Ryan
29  2020-02-10 10:06:12 Angel
30  2020-02-10 10:06:18 Cole
31  2020-02-10 10:06:30 James
32  2020-02-10 10:06:37 Eric

我需要这样：

 Data                Name   cluster
0   2020-02-10 10:00:23 Ann     0
1   2020-02-10 10:01:23 Jhon    0
2   2020-02-10 10:01:30 Chase   0
3   2020-02-10 10:01:43 Bruce   0
4   2020-02-10 10:02:02 Evan    0
5   2020-02-10 10:02:30 Fred    1
6   2020-02-10 10:02:35 Hugh    1
7   2020-02-10 10:02:50 Gregory 1
8   2020-02-10 10:02:58 Jack    0
9   2020-02-10 10:03:02 Caleb   0
10  2020-02-10 10:03:10 Eric    2
11  2020-02-10 10:03:15 James   2
12  2020-02-10 10:03:26 Ann     0
13  2020-02-10 10:03:32 Gerld   0
14  2020-02-10 10:03:38 Jess    0
15  2020-02-10 10:03:40 Juan    0
16  2020-02-10 10:03:46 Luke    0
17  2020-02-10 10:03:50 Kyle    0
18  2020-02-10 10:04:04 Neil    0
19  2020-02-10 10:04:12 Owen    0
20  2020-02-10 10:04:23 James   2
21  2020-02-10 10:04:27 Eric    2
22  2020-02-10 10:04:45 Jhon    0
23  2020-02-10 10:04:50 Jess    0
24  2020-02-10 10:04:59 Norman  0
25  2020-02-10 10:05:20 Hugh    1
26  2020-02-10 10:05:26 Fred    1
27  2020-02-10 10:05:40 Gregory 1
28  2020-02-10 10:05:56 Ryan    0
29  2020-02-10 10:06:12 Angel   0
30  2020-02-10 10:06:18 Cole    0
31  2020-02-10 10:06:30 James   2
32  2020-02-10 10:06:37 Eric    2

可以看到Fred、Gregory和Hugh都经过了好几次，所以建立了友好的联系。还有，James和Eric是一起过的，所以也是友情。

帮助我们使用机器学习解决问题，比如聚类或图形分析。告诉我，也许有人有想法。

Answer 1

不需要聚类算法。如果您的数据具有多个特征，则此类算法很有用。在这种情况下，只有一个：到达时间。只需跟踪成对“一起”到达的频率即可。

loop over arrivals
   loop over previous arrivals, recent enough to be friends
       increment count for this pair
loop over pairs
   if count above minimum, mark as friends

设置好友到达的最长时间为20秒，一对被识别为好友的最小频率为2，则我们得到：

togerther count:
Angel Cole 1
Angel James 1
Ann Gerld 1
Ann Jess 1
Ann Juan 1
Ann Luke 1
Bruce Evan 1
Caleb Eric 1
Caleb James 1
Chase Bruce 1
Cole Eric 1
Cole James 1
Eric Ann 1
Eric James 3
Eric Jhon 1
Fred Gregory 2
Fred Hugh 2
Gerld Jess 1
Gerld Juan 1
Gerld Kyle 1
Gerld Luke 1
Gregory Caleb 1
Gregory Eric 1
Gregory Jack 1
Gregory Ryan 1
Hugh Gregory 2
Jack Caleb 1
Jack Eric 1
Jack James 1
James Ann 1
James Gerld 1
Jess Juan 1
Jess Kyle 1
Jess Luke 1
Jess Norman 1
Jhon Bruce 1
Jhon Chase 1
Jhon Jess 1
Jhon Norman 1
Juan Kyle 1
Juan Luke 1
Kyle Neil 1
Luke Kyle 1
Luke Neil 1
Neil James 1
Neil Owen 1
Owen Eric 1
Owen James 1
Ryan Angel 1

所以朋友们是

friends:
Eric James 3
Fred Gregory 2
Fred Hugh 2
Hugh Gregory 2

您可以在 https://gist.github.com/JamesBremner/cba0a5e8bbda9388c3e983c3bc5dfd9b

看到实现这个的 C++ 代码

在数据框中定义友谊

Defining friendships in a Data Frame

python

graph

cluster-analysis

pandas