在多索引数据框中保留列表重叠的行

Keep rows with list overlaps in a multiindex dataframe

我有一个带有两个索引(“id”和“cluster”)的 df。有一个“团队”列,它是一个列表。像下面的 df:

id          cluster          team
1             5            [CS, VS]
              6              [CS]
2             3            [CS, CS]
              1              [VS]
              2              [TD]
3             8          [CS, CS, VS]
              9              [TD]

我想查看每个“id”,看看它的“team”行是否有重叠。比如第一个id中,[CS,VS]和[CS]有重叠,共享“CS”。如果没有重叠,则删除该“id”及其行。所以,输出应该是这样的:

id          cluster          team
1             5            [CS, VS]
              6              [CS]

"id" 2 和 3 被丢弃,因为 [CS, CS]、[VS]、[TD] 或 [CS, CS, VS]、[TD] 之间没有重叠。

感谢您的宝贵时间和帮助!

这个怎么样:

# Let's show how the data is built
df = pd.DataFrame({
    'id': [1, 1, 2, 2, 2, 3, 3],
    'cluster': [5, 6, 3, 1, 2, 8, 9],
    'team': [["CS", "VS"], ["CS"], ["CS", "CS"], ["VS"], ["TD"], ["CS", "CS", "VS"], ["TD"]]
}).set_index(['id', 'cluster'])

# Find which indices in the DataFrame meet the criteria
overlap_idx = (
    df.explode("team")  # Dealing with list columns is awful, let's make them individual rows
      .reset_index()  # Need to reset index for the drop_duplicates call
      .drop_duplicates()  # We don't want to count duplicate values in the same cluster as an overlap
      .groupby(["id", "team"])  # Now we're checking if multiple clusters have the same team within each id
      .filter(lambda x: any(x.count() > 1))  # Count the number of clusters with same id and team, keep if > 1
      .set_index(["id", "cluster"])  # Put the index back
).index  # Only keep the final, filtered index

df.loc[overlap_idx]  # Select overlap indices from original data

我写了这段代码,它正确地完成了工作。但是,它不是很有效。我仍然很欣赏任何更有效的解决方案:)

for rows in set(df.index.get_level_values(0)):

     all_items = list(df.loc[rows]['team'])
     unique_items_in_each_row = list(itertools.chain.from_iterable([set(i) 
                                 for i in all_items]))
     unique_items_all_rows = set(unique_items_in_each_row)
     if len(unique_items_in_each_row) == len(unique_items_all_rows):
          df.drop([rows], inplace=True)