在数据框中定义友谊
Defining friendships in a Data Frame
我有一项任务需要确定是否存在友好连接。让我解释一下,工作中有一个检查站。通过它的员工进入数据库,其中记录了他的通过时间和姓名。如果员工经常与同一个人经过该点,则可以以一定概率假设他们之间存在友好关系。还要考虑到他们经过的时间差,如果时间差很大,那他们很可能连面都没见过。比如我做了一个小时间系列:
import pandas as pd
dict_df={
'Data':['2020-02-10 10:00:23','2020-02-10 10:01:23','2020-02-10 10:01:30','2020-02-10 10:01:43',
'2020-02-10 10:02:02','2020-02-10 10:02:30','2020-02-10 10:02:35','2020-02-10 10:02:50',
'2020-02-10 10:02:58','2020-02-10 10:03:02','2020-02-10 10:03:10','2020-02-10 10:03:15',
'2020-02-10 10:03:26','2020-02-10 10:03:32','2020-02-10 10:03:38','2020-02-10 10:03:40',
'2020-02-10 10:03:46','2020-02-10 10:03:50','2020-02-10 10:04:04','2020-02-10 10:04:12',
'2020-02-10 10:04:23','2020-02-10 10:04:27','2020-02-10 10:04:45','2020-02-10 10:04:50',
'2020-02-10 10:04:59','2020-02-10 10:05:20','2020-02-10 10:05:26','2020-02-10 10:05:40',
'2020-02-10 10:05:56','2020-02-10 10:06:12','2020-02-10 10:06:18','2020-02-10 10:06:30',
'2020-02-10 10:06:37'],
'Name':['Ann','Jhon','Chase','Bruce','Evan','Fred','Hugh','Gregory','Jack','Caleb','Eric','James',
'Ann','Gerld','Jess','Juan','Luke','Kyle','Neil','Owen','James','Eric','Jhon','Jess','Norman',
'Hugh','Fred','Gregory','Ryan','Angel','Cole','James','Eric']}
df=pd.DataFrame(dict_df)
这是它的样子:
Data Name
0 2020-02-10 10:00:23 Ann
1 2020-02-10 10:01:23 Jhon
2 2020-02-10 10:01:30 Chase
3 2020-02-10 10:01:43 Bruce
4 2020-02-10 10:02:02 Evan
5 2020-02-10 10:02:30 Fred
6 2020-02-10 10:02:35 Hugh
7 2020-02-10 10:02:50 Gregory
8 2020-02-10 10:02:58 Jack
9 2020-02-10 10:03:02 Caleb
10 2020-02-10 10:03:10 Eric
11 2020-02-10 10:03:15 James
12 2020-02-10 10:03:26 Ann
13 2020-02-10 10:03:32 Gerld
14 2020-02-10 10:03:38 Jess
15 2020-02-10 10:03:40 Juan
16 2020-02-10 10:03:46 Luke
17 2020-02-10 10:03:50 Kyle
18 2020-02-10 10:04:04 Neil
19 2020-02-10 10:04:12 Owen
20 2020-02-10 10:04:23 James
21 2020-02-10 10:04:27 Eric
22 2020-02-10 10:04:45 Jhon
23 2020-02-10 10:04:50 Jess
24 2020-02-10 10:04:59 Norman
25 2020-02-10 10:05:20 Hugh
26 2020-02-10 10:05:26 Fred
27 2020-02-10 10:05:40 Gregory
28 2020-02-10 10:05:56 Ryan
29 2020-02-10 10:06:12 Angel
30 2020-02-10 10:06:18 Cole
31 2020-02-10 10:06:30 James
32 2020-02-10 10:06:37 Eric
我需要这样:
Data Name cluster
0 2020-02-10 10:00:23 Ann 0
1 2020-02-10 10:01:23 Jhon 0
2 2020-02-10 10:01:30 Chase 0
3 2020-02-10 10:01:43 Bruce 0
4 2020-02-10 10:02:02 Evan 0
5 2020-02-10 10:02:30 Fred 1
6 2020-02-10 10:02:35 Hugh 1
7 2020-02-10 10:02:50 Gregory 1
8 2020-02-10 10:02:58 Jack 0
9 2020-02-10 10:03:02 Caleb 0
10 2020-02-10 10:03:10 Eric 2
11 2020-02-10 10:03:15 James 2
12 2020-02-10 10:03:26 Ann 0
13 2020-02-10 10:03:32 Gerld 0
14 2020-02-10 10:03:38 Jess 0
15 2020-02-10 10:03:40 Juan 0
16 2020-02-10 10:03:46 Luke 0
17 2020-02-10 10:03:50 Kyle 0
18 2020-02-10 10:04:04 Neil 0
19 2020-02-10 10:04:12 Owen 0
20 2020-02-10 10:04:23 James 2
21 2020-02-10 10:04:27 Eric 2
22 2020-02-10 10:04:45 Jhon 0
23 2020-02-10 10:04:50 Jess 0
24 2020-02-10 10:04:59 Norman 0
25 2020-02-10 10:05:20 Hugh 1
26 2020-02-10 10:05:26 Fred 1
27 2020-02-10 10:05:40 Gregory 1
28 2020-02-10 10:05:56 Ryan 0
29 2020-02-10 10:06:12 Angel 0
30 2020-02-10 10:06:18 Cole 0
31 2020-02-10 10:06:30 James 2
32 2020-02-10 10:06:37 Eric 2
可以看到Fred、Gregory和Hugh都经过了好几次,所以建立了友好的联系。还有,James和Eric是一起过的,所以也是友情。
帮助我们使用机器学习解决问题,比如聚类或图形分析。告诉我,也许有人有想法。
不需要聚类算法。如果您的数据具有多个特征,则此类算法很有用。在这种情况下,只有一个:到达时间。只需跟踪成对“一起”到达的频率即可。
loop over arrivals
loop over previous arrivals, recent enough to be friends
increment count for this pair
loop over pairs
if count above minimum, mark as friends
设置好友到达的最长时间为20秒,一对被识别为好友的最小频率为2,则我们得到:
togerther count:
Angel Cole 1
Angel James 1
Ann Gerld 1
Ann Jess 1
Ann Juan 1
Ann Luke 1
Bruce Evan 1
Caleb Eric 1
Caleb James 1
Chase Bruce 1
Cole Eric 1
Cole James 1
Eric Ann 1
Eric James 3
Eric Jhon 1
Fred Gregory 2
Fred Hugh 2
Gerld Jess 1
Gerld Juan 1
Gerld Kyle 1
Gerld Luke 1
Gregory Caleb 1
Gregory Eric 1
Gregory Jack 1
Gregory Ryan 1
Hugh Gregory 2
Jack Caleb 1
Jack Eric 1
Jack James 1
James Ann 1
James Gerld 1
Jess Juan 1
Jess Kyle 1
Jess Luke 1
Jess Norman 1
Jhon Bruce 1
Jhon Chase 1
Jhon Jess 1
Jhon Norman 1
Juan Kyle 1
Juan Luke 1
Kyle Neil 1
Luke Kyle 1
Luke Neil 1
Neil James 1
Neil Owen 1
Owen Eric 1
Owen James 1
Ryan Angel 1
所以朋友们是
friends:
Eric James 3
Fred Gregory 2
Fred Hugh 2
Hugh Gregory 2
您可以在 https://gist.github.com/JamesBremner/cba0a5e8bbda9388c3e983c3bc5dfd9b
看到实现这个的 C++ 代码
我有一项任务需要确定是否存在友好连接。让我解释一下,工作中有一个检查站。通过它的员工进入数据库,其中记录了他的通过时间和姓名。如果员工经常与同一个人经过该点,则可以以一定概率假设他们之间存在友好关系。还要考虑到他们经过的时间差,如果时间差很大,那他们很可能连面都没见过。比如我做了一个小时间系列:
import pandas as pd
dict_df={
'Data':['2020-02-10 10:00:23','2020-02-10 10:01:23','2020-02-10 10:01:30','2020-02-10 10:01:43',
'2020-02-10 10:02:02','2020-02-10 10:02:30','2020-02-10 10:02:35','2020-02-10 10:02:50',
'2020-02-10 10:02:58','2020-02-10 10:03:02','2020-02-10 10:03:10','2020-02-10 10:03:15',
'2020-02-10 10:03:26','2020-02-10 10:03:32','2020-02-10 10:03:38','2020-02-10 10:03:40',
'2020-02-10 10:03:46','2020-02-10 10:03:50','2020-02-10 10:04:04','2020-02-10 10:04:12',
'2020-02-10 10:04:23','2020-02-10 10:04:27','2020-02-10 10:04:45','2020-02-10 10:04:50',
'2020-02-10 10:04:59','2020-02-10 10:05:20','2020-02-10 10:05:26','2020-02-10 10:05:40',
'2020-02-10 10:05:56','2020-02-10 10:06:12','2020-02-10 10:06:18','2020-02-10 10:06:30',
'2020-02-10 10:06:37'],
'Name':['Ann','Jhon','Chase','Bruce','Evan','Fred','Hugh','Gregory','Jack','Caleb','Eric','James',
'Ann','Gerld','Jess','Juan','Luke','Kyle','Neil','Owen','James','Eric','Jhon','Jess','Norman',
'Hugh','Fred','Gregory','Ryan','Angel','Cole','James','Eric']}
df=pd.DataFrame(dict_df)
这是它的样子:
Data Name
0 2020-02-10 10:00:23 Ann
1 2020-02-10 10:01:23 Jhon
2 2020-02-10 10:01:30 Chase
3 2020-02-10 10:01:43 Bruce
4 2020-02-10 10:02:02 Evan
5 2020-02-10 10:02:30 Fred
6 2020-02-10 10:02:35 Hugh
7 2020-02-10 10:02:50 Gregory
8 2020-02-10 10:02:58 Jack
9 2020-02-10 10:03:02 Caleb
10 2020-02-10 10:03:10 Eric
11 2020-02-10 10:03:15 James
12 2020-02-10 10:03:26 Ann
13 2020-02-10 10:03:32 Gerld
14 2020-02-10 10:03:38 Jess
15 2020-02-10 10:03:40 Juan
16 2020-02-10 10:03:46 Luke
17 2020-02-10 10:03:50 Kyle
18 2020-02-10 10:04:04 Neil
19 2020-02-10 10:04:12 Owen
20 2020-02-10 10:04:23 James
21 2020-02-10 10:04:27 Eric
22 2020-02-10 10:04:45 Jhon
23 2020-02-10 10:04:50 Jess
24 2020-02-10 10:04:59 Norman
25 2020-02-10 10:05:20 Hugh
26 2020-02-10 10:05:26 Fred
27 2020-02-10 10:05:40 Gregory
28 2020-02-10 10:05:56 Ryan
29 2020-02-10 10:06:12 Angel
30 2020-02-10 10:06:18 Cole
31 2020-02-10 10:06:30 James
32 2020-02-10 10:06:37 Eric
我需要这样:
Data Name cluster
0 2020-02-10 10:00:23 Ann 0
1 2020-02-10 10:01:23 Jhon 0
2 2020-02-10 10:01:30 Chase 0
3 2020-02-10 10:01:43 Bruce 0
4 2020-02-10 10:02:02 Evan 0
5 2020-02-10 10:02:30 Fred 1
6 2020-02-10 10:02:35 Hugh 1
7 2020-02-10 10:02:50 Gregory 1
8 2020-02-10 10:02:58 Jack 0
9 2020-02-10 10:03:02 Caleb 0
10 2020-02-10 10:03:10 Eric 2
11 2020-02-10 10:03:15 James 2
12 2020-02-10 10:03:26 Ann 0
13 2020-02-10 10:03:32 Gerld 0
14 2020-02-10 10:03:38 Jess 0
15 2020-02-10 10:03:40 Juan 0
16 2020-02-10 10:03:46 Luke 0
17 2020-02-10 10:03:50 Kyle 0
18 2020-02-10 10:04:04 Neil 0
19 2020-02-10 10:04:12 Owen 0
20 2020-02-10 10:04:23 James 2
21 2020-02-10 10:04:27 Eric 2
22 2020-02-10 10:04:45 Jhon 0
23 2020-02-10 10:04:50 Jess 0
24 2020-02-10 10:04:59 Norman 0
25 2020-02-10 10:05:20 Hugh 1
26 2020-02-10 10:05:26 Fred 1
27 2020-02-10 10:05:40 Gregory 1
28 2020-02-10 10:05:56 Ryan 0
29 2020-02-10 10:06:12 Angel 0
30 2020-02-10 10:06:18 Cole 0
31 2020-02-10 10:06:30 James 2
32 2020-02-10 10:06:37 Eric 2
可以看到Fred、Gregory和Hugh都经过了好几次,所以建立了友好的联系。还有,James和Eric是一起过的,所以也是友情。
帮助我们使用机器学习解决问题,比如聚类或图形分析。告诉我,也许有人有想法。
不需要聚类算法。如果您的数据具有多个特征,则此类算法很有用。在这种情况下,只有一个:到达时间。只需跟踪成对“一起”到达的频率即可。
loop over arrivals
loop over previous arrivals, recent enough to be friends
increment count for this pair
loop over pairs
if count above minimum, mark as friends
设置好友到达的最长时间为20秒,一对被识别为好友的最小频率为2,则我们得到:
togerther count:
Angel Cole 1
Angel James 1
Ann Gerld 1
Ann Jess 1
Ann Juan 1
Ann Luke 1
Bruce Evan 1
Caleb Eric 1
Caleb James 1
Chase Bruce 1
Cole Eric 1
Cole James 1
Eric Ann 1
Eric James 3
Eric Jhon 1
Fred Gregory 2
Fred Hugh 2
Gerld Jess 1
Gerld Juan 1
Gerld Kyle 1
Gerld Luke 1
Gregory Caleb 1
Gregory Eric 1
Gregory Jack 1
Gregory Ryan 1
Hugh Gregory 2
Jack Caleb 1
Jack Eric 1
Jack James 1
James Ann 1
James Gerld 1
Jess Juan 1
Jess Kyle 1
Jess Luke 1
Jess Norman 1
Jhon Bruce 1
Jhon Chase 1
Jhon Jess 1
Jhon Norman 1
Juan Kyle 1
Juan Luke 1
Kyle Neil 1
Luke Kyle 1
Luke Neil 1
Neil James 1
Neil Owen 1
Owen Eric 1
Owen James 1
Ryan Angel 1
所以朋友们是
friends:
Eric James 3
Fred Gregory 2
Fred Hugh 2
Hugh Gregory 2
您可以在 https://gist.github.com/JamesBremner/cba0a5e8bbda9388c3e983c3bc5dfd9b
看到实现这个的 C++ 代码