根据另一列查找经常一起出现的类别

Question

假设我在 Pandas 数据框中有以下数据：

Paper ID	Author ID
Paper_1	Author_1
Paper_1	Author_2
Paper_2	Author_2
Paper_3	Author_1
Paper_3	Author_2
Paper_3	Author_3
Paper_4	Author_1
Paper_4	Author_3

我需要找到非零协作的数量。所以，输出应该是：
(Author_1,Author_2) --> 2
(Author_1,Author_3) --> 1

如有任何帮助或建议，我们将不胜感激。

Answer 1

如果数据相当小，那么在 Paper ID 上合并将生成可以是 collapsed/aggregated:

的对

# assume df has columns Paper ID, Author ID
df_merged = df.merge(df, on="Paper ID")

# keep only one instance of a collaboration
mask = df_merged["Author ID_x"] > df_merged["Author ID_y"]

# aggregate (note the use of the mask to avoid double-
# counting and self-collaborations as noted in the
# comment by Riccardo Bucco)
counts = (
    df_merged[mask]
    .groupby(["Author ID_x", "Author ID_y"])
    .agg(collaboration_count=("Paper ID", "count"))
)

根据另一列查找经常一起出现的类别

Find the categories that frequently occur together based on another column

python

numpy

dataframe

pandas

data-cleaning