根据另一列查找经常一起出现的类别
Find the categories that frequently occur together based on another column
假设我在 Pandas 数据框中有以下数据:
Paper ID
Author ID
Paper_1
Author_1
Paper_1
Author_2
Paper_2
Author_2
Paper_3
Author_1
Paper_3
Author_2
Paper_3
Author_3
Paper_4
Author_1
Paper_4
Author_3
我需要找到非零协作的数量。所以,输出应该是:
(Author_1,Author_2) --> 2
(Author_1,Author_3) --> 1
如有任何帮助或建议,我们将不胜感激。
如果数据相当小,那么在 Paper ID
上合并将生成可以是 collapsed/aggregated:
的对
# assume df has columns Paper ID, Author ID
df_merged = df.merge(df, on="Paper ID")
# keep only one instance of a collaboration
mask = df_merged["Author ID_x"] > df_merged["Author ID_y"]
# aggregate (note the use of the mask to avoid double-
# counting and self-collaborations as noted in the
# comment by Riccardo Bucco)
counts = (
df_merged[mask]
.groupby(["Author ID_x", "Author ID_y"])
.agg(collaboration_count=("Paper ID", "count"))
)
假设我在 Pandas 数据框中有以下数据:
Paper ID | Author ID |
---|---|
Paper_1 | Author_1 |
Paper_1 | Author_2 |
Paper_2 | Author_2 |
Paper_3 | Author_1 |
Paper_3 | Author_2 |
Paper_3 | Author_3 |
Paper_4 | Author_1 |
Paper_4 | Author_3 |
我需要找到非零协作的数量。所以,输出应该是:
(Author_1,Author_2) --> 2
(Author_1,Author_3) --> 1
如有任何帮助或建议,我们将不胜感激。
如果数据相当小,那么在 Paper ID
上合并将生成可以是 collapsed/aggregated:
# assume df has columns Paper ID, Author ID
df_merged = df.merge(df, on="Paper ID")
# keep only one instance of a collaboration
mask = df_merged["Author ID_x"] > df_merged["Author ID_y"]
# aggregate (note the use of the mask to avoid double-
# counting and self-collaborations as noted in the
# comment by Riccardo Bucco)
counts = (
df_merged[mask]
.groupby(["Author ID_x", "Author ID_y"])
.agg(collaboration_count=("Paper ID", "count"))
)