在多索引数据框中保留列表重叠的行
Keep rows with list overlaps in a multiindex dataframe
我有一个带有两个索引(“id”和“cluster”)的 df。有一个“团队”列,它是一个列表。像下面的 df:
id cluster team
1 5 [CS, VS]
6 [CS]
2 3 [CS, CS]
1 [VS]
2 [TD]
3 8 [CS, CS, VS]
9 [TD]
我想查看每个“id”,看看它的“team”行是否有重叠。比如第一个id中,[CS,VS]和[CS]有重叠,共享“CS”。如果没有重叠,则删除该“id”及其行。所以,输出应该是这样的:
id cluster team
1 5 [CS, VS]
6 [CS]
"id" 2 和 3 被丢弃,因为 [CS, CS]、[VS]、[TD] 或 [CS, CS, VS]、[TD] 之间没有重叠。
感谢您的宝贵时间和帮助!
这个怎么样:
# Let's show how the data is built
df = pd.DataFrame({
'id': [1, 1, 2, 2, 2, 3, 3],
'cluster': [5, 6, 3, 1, 2, 8, 9],
'team': [["CS", "VS"], ["CS"], ["CS", "CS"], ["VS"], ["TD"], ["CS", "CS", "VS"], ["TD"]]
}).set_index(['id', 'cluster'])
# Find which indices in the DataFrame meet the criteria
overlap_idx = (
df.explode("team") # Dealing with list columns is awful, let's make them individual rows
.reset_index() # Need to reset index for the drop_duplicates call
.drop_duplicates() # We don't want to count duplicate values in the same cluster as an overlap
.groupby(["id", "team"]) # Now we're checking if multiple clusters have the same team within each id
.filter(lambda x: any(x.count() > 1)) # Count the number of clusters with same id and team, keep if > 1
.set_index(["id", "cluster"]) # Put the index back
).index # Only keep the final, filtered index
df.loc[overlap_idx] # Select overlap indices from original data
我写了这段代码,它正确地完成了工作。但是,它不是很有效。我仍然很欣赏任何更有效的解决方案:)
for rows in set(df.index.get_level_values(0)):
all_items = list(df.loc[rows]['team'])
unique_items_in_each_row = list(itertools.chain.from_iterable([set(i)
for i in all_items]))
unique_items_all_rows = set(unique_items_in_each_row)
if len(unique_items_in_each_row) == len(unique_items_all_rows):
df.drop([rows], inplace=True)
我有一个带有两个索引(“id”和“cluster”)的 df。有一个“团队”列,它是一个列表。像下面的 df:
id cluster team
1 5 [CS, VS]
6 [CS]
2 3 [CS, CS]
1 [VS]
2 [TD]
3 8 [CS, CS, VS]
9 [TD]
我想查看每个“id”,看看它的“team”行是否有重叠。比如第一个id中,[CS,VS]和[CS]有重叠,共享“CS”。如果没有重叠,则删除该“id”及其行。所以,输出应该是这样的:
id cluster team
1 5 [CS, VS]
6 [CS]
"id" 2 和 3 被丢弃,因为 [CS, CS]、[VS]、[TD] 或 [CS, CS, VS]、[TD] 之间没有重叠。
感谢您的宝贵时间和帮助!
这个怎么样:
# Let's show how the data is built
df = pd.DataFrame({
'id': [1, 1, 2, 2, 2, 3, 3],
'cluster': [5, 6, 3, 1, 2, 8, 9],
'team': [["CS", "VS"], ["CS"], ["CS", "CS"], ["VS"], ["TD"], ["CS", "CS", "VS"], ["TD"]]
}).set_index(['id', 'cluster'])
# Find which indices in the DataFrame meet the criteria
overlap_idx = (
df.explode("team") # Dealing with list columns is awful, let's make them individual rows
.reset_index() # Need to reset index for the drop_duplicates call
.drop_duplicates() # We don't want to count duplicate values in the same cluster as an overlap
.groupby(["id", "team"]) # Now we're checking if multiple clusters have the same team within each id
.filter(lambda x: any(x.count() > 1)) # Count the number of clusters with same id and team, keep if > 1
.set_index(["id", "cluster"]) # Put the index back
).index # Only keep the final, filtered index
df.loc[overlap_idx] # Select overlap indices from original data
我写了这段代码,它正确地完成了工作。但是,它不是很有效。我仍然很欣赏任何更有效的解决方案:)
for rows in set(df.index.get_level_values(0)):
all_items = list(df.loc[rows]['team'])
unique_items_in_each_row = list(itertools.chain.from_iterable([set(i)
for i in all_items]))
unique_items_all_rows = set(unique_items_in_each_row)
if len(unique_items_in_each_row) == len(unique_items_all_rows):
df.drop([rows], inplace=True)