Python/Pandas:按日期和ID对记录进行分组统计
Python/Pandas: Grouping and counting records by date and ID
我在 Python 中有一个相对较大的数据框(~10^6 条记录),结构如下:
Index,Date,City,State,ID,County,Age,A,B,C
0,9/1/16,X,AL,360,BB County,29.0,negative,positive,positive
1,9/1/16,X,AL,360,BB County,1.0,negative,negative,negative
2,9/1/16,X,AL,360,BB County,10.0,negative,negative,negative
3,9/1/16,X,AL,360,BB County,11.0,negative,negative,negative
4,9/1/16,X,AR,718,LL County,67.0,negative,negative,negative
5,9/1/16,X,AR,728,JJ County,3.0,negative,negative,negative
6,9/1/16,X,AR,728,JJ County,8.0,negative,negative,negative
7,9/1/16,X,AR,728,JJ County,8.0,negative,negative,negative
8,9/1/16,X,AR,728,JJ County,14.0,negative,negative,negative
9,9/1/16,X,AR,728,JJ County,5.0,negative,negative,negative
...
我试图按日期(天)和 ID 分组,然后计算 1) 每天和 ID 的记录总数,以及 2) "positives" 列中的总数 "A"(例如)对于每一天和 ID。最终,我想填充一个数据框,该数据框指示每天的阳性数和记录总数以及 ID,例如
Date,ID,Positive,Total
9/1/16,360,10,20
9/2/16,360,12,23
9/2/16,718,2,43
...
我最初使用了一个双 for 循环来遍历每个唯一的日期和 ID,但这花费了太多时间。我将不胜感激更好的方法。提前感谢您的任何意见!
我获取了您提供的数据并创建了一个小的 .csv 文件,以便您可以复制...此外,我更改了几个值来测试这项工作:
Index,Date,City,State,ID,County,Age,A,B,C
0,9/1/16,X,AL,360,BB County,29.0,negative,positive,positive
1,9/1/16,X,AL,360,BB County,1.0,positive,negative,negative
2,9/1/16,X,AL,360,BB County,10.0,positive,negative,negative
3,9/1/16,X,AL,360,BB County,11.0,negative,negative,negative
4,9/1/16,X,AR,718,LL County,67.0,negative,negative,negative
5,9/2/16,X,AR,728,JJ County,3.0,negative,negative,negative
6,9/2/16,X,AR,728,JJ County,8.0,positive,negative,negative
7,9/2/16,X,AR,728,JJ County,8.0,negative,negative,negative
8,9/3/16,X,AR,728,JJ County,14.0,negative,negative,negative
9,9/3/16,X,AR,728,JJ County,5.0,negative,negative,negative
读完后,情况如下:
>>> X = pd.read_csv('data.csv', header=0, index_col=None).drop('Index', axis=1)
>>> print(X)
Date City State ID County Age A B C
0 9/1/16 X AL 360 BB County 29.0 negative positive positive
1 9/1/16 X AL 360 BB County 1.0 positive negative negative
2 9/1/16 X AL 360 BB County 10.0 positive negative negative
3 9/1/16 X AL 360 BB County 11.0 negative negative negative
4 9/1/16 X AR 718 LL County 67.0 negative negative negative
5 9/2/16 X AR 728 JJ County 3.0 negative negative negative
6 9/2/16 X AR 728 JJ County 8.0 positive negative negative
7 9/2/16 X AR 728 JJ County 8.0 negative negative negative
8 9/3/16 X AR 728 JJ County 14.0 negative negative negative
9 9/3/16 X AR 728 JJ County 5.0 negative negative negative
这是应用于 groupby
调用中每个组的函数:
def _ct_id_pos(grp):
return grp[grp.A == 'positive'].shape[0], grp.shape[0]
这将是一个两步过程...使用 pandas,您可以对多个列进行分组并应用上述功能。
# the following will have the tuple in one column
>>> X_prime = X.groupby(['Date', 'ID']).apply(_ct_id_pos).reset_index()
>>> print(X_prime)
Date ID 0
0 9/1/16 360 (2, 4)
1 9/1/16 718 (0, 1)
2 9/2/16 728 (1, 3)
3 9/3/16 728 (0, 2)
注意 groupby 函数的结果为我们提供了一个包含嵌入元组的新列,因此下一步是将它们拆分成各自的列并删除嵌入的列:
>>> X_prime[['Positive', 'Total']] = X_prime[0].apply(pd.Series)
>>> X_prime.drop([0], axis=1, inplace=True)
>>> print(X_prime)
Date ID Positive Total
0 9/1/16 360 2 4
1 9/1/16 718 0 1
2 9/2/16 728 1 3
3 9/3/16 728 0 2
我在 Python 中有一个相对较大的数据框(~10^6 条记录),结构如下:
Index,Date,City,State,ID,County,Age,A,B,C
0,9/1/16,X,AL,360,BB County,29.0,negative,positive,positive
1,9/1/16,X,AL,360,BB County,1.0,negative,negative,negative
2,9/1/16,X,AL,360,BB County,10.0,negative,negative,negative
3,9/1/16,X,AL,360,BB County,11.0,negative,negative,negative
4,9/1/16,X,AR,718,LL County,67.0,negative,negative,negative
5,9/1/16,X,AR,728,JJ County,3.0,negative,negative,negative
6,9/1/16,X,AR,728,JJ County,8.0,negative,negative,negative
7,9/1/16,X,AR,728,JJ County,8.0,negative,negative,negative
8,9/1/16,X,AR,728,JJ County,14.0,negative,negative,negative
9,9/1/16,X,AR,728,JJ County,5.0,negative,negative,negative
...
我试图按日期(天)和 ID 分组,然后计算 1) 每天和 ID 的记录总数,以及 2) "positives" 列中的总数 "A"(例如)对于每一天和 ID。最终,我想填充一个数据框,该数据框指示每天的阳性数和记录总数以及 ID,例如
Date,ID,Positive,Total
9/1/16,360,10,20
9/2/16,360,12,23
9/2/16,718,2,43
...
我最初使用了一个双 for 循环来遍历每个唯一的日期和 ID,但这花费了太多时间。我将不胜感激更好的方法。提前感谢您的任何意见!
我获取了您提供的数据并创建了一个小的 .csv 文件,以便您可以复制...此外,我更改了几个值来测试这项工作:
Index,Date,City,State,ID,County,Age,A,B,C
0,9/1/16,X,AL,360,BB County,29.0,negative,positive,positive
1,9/1/16,X,AL,360,BB County,1.0,positive,negative,negative
2,9/1/16,X,AL,360,BB County,10.0,positive,negative,negative
3,9/1/16,X,AL,360,BB County,11.0,negative,negative,negative
4,9/1/16,X,AR,718,LL County,67.0,negative,negative,negative
5,9/2/16,X,AR,728,JJ County,3.0,negative,negative,negative
6,9/2/16,X,AR,728,JJ County,8.0,positive,negative,negative
7,9/2/16,X,AR,728,JJ County,8.0,negative,negative,negative
8,9/3/16,X,AR,728,JJ County,14.0,negative,negative,negative
9,9/3/16,X,AR,728,JJ County,5.0,negative,negative,negative
读完后,情况如下:
>>> X = pd.read_csv('data.csv', header=0, index_col=None).drop('Index', axis=1)
>>> print(X)
Date City State ID County Age A B C
0 9/1/16 X AL 360 BB County 29.0 negative positive positive
1 9/1/16 X AL 360 BB County 1.0 positive negative negative
2 9/1/16 X AL 360 BB County 10.0 positive negative negative
3 9/1/16 X AL 360 BB County 11.0 negative negative negative
4 9/1/16 X AR 718 LL County 67.0 negative negative negative
5 9/2/16 X AR 728 JJ County 3.0 negative negative negative
6 9/2/16 X AR 728 JJ County 8.0 positive negative negative
7 9/2/16 X AR 728 JJ County 8.0 negative negative negative
8 9/3/16 X AR 728 JJ County 14.0 negative negative negative
9 9/3/16 X AR 728 JJ County 5.0 negative negative negative
这是应用于 groupby
调用中每个组的函数:
def _ct_id_pos(grp):
return grp[grp.A == 'positive'].shape[0], grp.shape[0]
这将是一个两步过程...使用 pandas,您可以对多个列进行分组并应用上述功能。
# the following will have the tuple in one column
>>> X_prime = X.groupby(['Date', 'ID']).apply(_ct_id_pos).reset_index()
>>> print(X_prime)
Date ID 0
0 9/1/16 360 (2, 4)
1 9/1/16 718 (0, 1)
2 9/2/16 728 (1, 3)
3 9/3/16 728 (0, 2)
注意 groupby 函数的结果为我们提供了一个包含嵌入元组的新列,因此下一步是将它们拆分成各自的列并删除嵌入的列:
>>> X_prime[['Positive', 'Total']] = X_prime[0].apply(pd.Series)
>>> X_prime.drop([0], axis=1, inplace=True)
>>> print(X_prime)
Date ID Positive Total
0 9/1/16 360 2 4
1 9/1/16 718 0 1
2 9/2/16 728 1 3
3 9/3/16 728 0 2