可以做些什么来以最有效的方式在这里获得连续计数?
What can can be done to get consecutive counts here in the most efficient way?
我是 Python 数据科学的初学者。我正在处理点击流数据,并试图计算给定会话中某个项目的连续点击次数。我在 'Block' 列中得到累计总和。之后,我在 Block 上聚合以获取每个块的计数。最后我想按 Session 和 Item 分组并聚合块计数,因为可能存在这样的情况(此处 Sid = 6),其中一个项目首先连续出现 m 次,然后在其他项目之后再次出现,它连续出现 n 次。所以连续计数应该是'm+n'.
这是数据集-
Sid Tstamp Itemid
0 1 2014-04-07T10:51:09.277Z 214536502
1 1 2014-04-07T10:54:09.868Z 214536500
2 1 2014-04-07T10:54:46.998Z 214536506
3 1 2014-04-07T10:57:00.306Z 214577561
4 2 2014-04-07T13:56:37.614Z 214662742
5 2 2014-04-07T13:57:19.373Z 214662742
6 2 2014-04-07T13:58:37.446Z 214825110
7 2 2014-04-07T13:59:50.710Z 214757390
8 2 2014-04-07T14:00:38.247Z 214757407
9 2 2014-04-07T14:02:36.889Z 214551617
10 3 2014-04-02T13:17:46.940Z 214716935
11 3 2014-04-02T13:26:02.515Z 214774687
12 3 2014-04-02T13:30:12.318Z 214832672
13 4 2014-04-07T12:09:10.948Z 214836765
14 4 2014-04-07T12:26:25.416Z 214706482
15 6 2014-04-03T10:44:35.672Z 214821275
16 6 2014-04-03T10:45:01.674Z 214821275
17 6 2014-04-03T10:45:29.873Z 214821371
18 6 2014-04-03T10:46:12.162Z 214821371
19 6 2014-04-03T10:46:57.355Z 214821371
20 6 2014-04-03T10:53:22.572Z 214717089
21 6 2014-04-03T10:53:49.875Z 214563337
22 6 2014-04-03T10:55:19.267Z 214706462
23 6 2014-04-03T10:55:47.327Z 214821371
24 6 2014-04-03T10:56:30.520Z 214821371
25 6 2014-04-03T10:57:19.331Z 214821371
26 6 2014-04-03T10:57:39.433Z 214819762
这是我的代码-
k['Block'] =( k['Itemid'] != k['Itemid'].shift(1) ).astype(int).cumsum()
y=k.groupby('Block').count()
z=k.groupby(['Sid','Itemid']).agg({"y[Count]": lambda x: x.sum()})
这不行吗?
k.groupby(['Sid', 'Itemid']).Block.count()
Sid Itemid
1 214536500 1
214536502 1
214536506 1
214577561 1
2 214551617 1
214662742 2
214757390 1
214757407 1
214825110 1
3 214716935 1
214774687 1
214832672 1
4 214706482 1
214836765 1
6 214563337 1
214706462 1
214717089 1
214819762 1
214821275 2
214821371 6
Name: Block, dtype: int64
IIUC 你可以:
k['Block'] =( k['Itemid'] != k['Itemid'].shift(1) ).astype(int).cumsum()
#print k
z=k.groupby(['Sid','Itemid', 'Block']).size().groupby(level=[0,1]).sum().reset_index(name='sum_counts')
print z
Sid Itemid sum_counts
0 1 214536500 1
1 1 214536502 1
2 1 214536506 1
3 1 214577561 1
4 2 214551617 1
5 2 214662742 2
6 2 214757390 1
7 2 214757407 1
8 2 214825110 1
9 3 214716935 1
10 3 214774687 1
11 3 214832672 1
12 4 214706482 1
13 4 214836765 1
14 6 214701242 1
15 6 214826623 1
16 7 214826715 1
17 7 214826835 1
18 8 214838855 2
19 9 214576500 3
20 11 214563337 1
21 11 214706462 1
22 11 214717089 1
23 11 214819762 1
24 11 214821275 2
25 11 214821371 6
我是 Python 数据科学的初学者。我正在处理点击流数据,并试图计算给定会话中某个项目的连续点击次数。我在 'Block' 列中得到累计总和。之后,我在 Block 上聚合以获取每个块的计数。最后我想按 Session 和 Item 分组并聚合块计数,因为可能存在这样的情况(此处 Sid = 6),其中一个项目首先连续出现 m 次,然后在其他项目之后再次出现,它连续出现 n 次。所以连续计数应该是'm+n'.
这是数据集-
Sid Tstamp Itemid 0 1 2014-04-07T10:51:09.277Z 214536502 1 1 2014-04-07T10:54:09.868Z 214536500 2 1 2014-04-07T10:54:46.998Z 214536506 3 1 2014-04-07T10:57:00.306Z 214577561 4 2 2014-04-07T13:56:37.614Z 214662742 5 2 2014-04-07T13:57:19.373Z 214662742 6 2 2014-04-07T13:58:37.446Z 214825110 7 2 2014-04-07T13:59:50.710Z 214757390 8 2 2014-04-07T14:00:38.247Z 214757407 9 2 2014-04-07T14:02:36.889Z 214551617 10 3 2014-04-02T13:17:46.940Z 214716935 11 3 2014-04-02T13:26:02.515Z 214774687 12 3 2014-04-02T13:30:12.318Z 214832672 13 4 2014-04-07T12:09:10.948Z 214836765 14 4 2014-04-07T12:26:25.416Z 214706482 15 6 2014-04-03T10:44:35.672Z 214821275 16 6 2014-04-03T10:45:01.674Z 214821275 17 6 2014-04-03T10:45:29.873Z 214821371 18 6 2014-04-03T10:46:12.162Z 214821371 19 6 2014-04-03T10:46:57.355Z 214821371 20 6 2014-04-03T10:53:22.572Z 214717089 21 6 2014-04-03T10:53:49.875Z 214563337 22 6 2014-04-03T10:55:19.267Z 214706462 23 6 2014-04-03T10:55:47.327Z 214821371 24 6 2014-04-03T10:56:30.520Z 214821371 25 6 2014-04-03T10:57:19.331Z 214821371 26 6 2014-04-03T10:57:39.433Z 214819762
这是我的代码-
k['Block'] =( k['Itemid'] != k['Itemid'].shift(1) ).astype(int).cumsum()
y=k.groupby('Block').count()
z=k.groupby(['Sid','Itemid']).agg({"y[Count]": lambda x: x.sum()})
这不行吗?
k.groupby(['Sid', 'Itemid']).Block.count()
Sid Itemid
1 214536500 1
214536502 1
214536506 1
214577561 1
2 214551617 1
214662742 2
214757390 1
214757407 1
214825110 1
3 214716935 1
214774687 1
214832672 1
4 214706482 1
214836765 1
6 214563337 1
214706462 1
214717089 1
214819762 1
214821275 2
214821371 6
Name: Block, dtype: int64
IIUC 你可以:
k['Block'] =( k['Itemid'] != k['Itemid'].shift(1) ).astype(int).cumsum()
#print k
z=k.groupby(['Sid','Itemid', 'Block']).size().groupby(level=[0,1]).sum().reset_index(name='sum_counts')
print z
Sid Itemid sum_counts
0 1 214536500 1
1 1 214536502 1
2 1 214536506 1
3 1 214577561 1
4 2 214551617 1
5 2 214662742 2
6 2 214757390 1
7 2 214757407 1
8 2 214825110 1
9 3 214716935 1
10 3 214774687 1
11 3 214832672 1
12 4 214706482 1
13 4 214836765 1
14 6 214701242 1
15 6 214826623 1
16 7 214826715 1
17 7 214826835 1
18 8 214838855 2
19 9 214576500 3
20 11 214563337 1
21 11 214706462 1
22 11 214717089 1
23 11 214819762 1
24 11 214821275 2
25 11 214821371 6