成对地在不同的列中删除相同的值（删除连接的组件）

Question

应用 levenshtein 距离算法后，我得到了这样的数据框：

Elemento_lista	Item_ID	Score	idx	ITEM_ID_Coincidencia
4	691776	100	5	691777
4	691776	100	6	691789
4	691776	100	7	691791
5	691777	100	4	691776
5	691777	100	6	691789
5	691777	100	7	691791
6	691789	100	4	691776
6	691789	100	5	691777
6	691789	100	7	691791
7	691791	100	4	691776
7	691791	100	5	691777
7	691791	100	6	691789
9	1407402	100	10	1407424
10	1407424	100	9	1407402

Elemento_lista 列是与其他元素进行比较的元素的索引， Item_ID 是元素的id， Score是算法生成的Score， idx 是被发现相似的元素的索引（与 Elemento_lista 相同，但对于被发现相似的元素）， ITEM_ID_Coincidencia 是找到的相似元素的 id

这是真实 DF 的一个小样本（超过 300000 行），我需要删除相同的行，例如...如果 Elemento_lista 4，等于 idx 5,6,和 7...它们都是一样的，所以我不需要 5 等于 4、6 和 7/ 6 等于 4,5,7 和 7 等于 4,5 的行,6.每个 Elemento_Lista 都一样：value=9 等于 idx 10，所以...我不需要行 Elemento_Lista 10 等于 idx 9...我怎么能删除这些行为了减少 DF len ???

最终的 DF 应该是：

Elemento_lista	Item_ID	Score	idx	ITEM_ID_Coincidencia
4	691776	100	5	691777
4	691776	100	6	691789
4	691776	100	7	691791
9	1407402	100	10	1407424

我不知道该怎么做...可能吗？

提前致谢

Answer 1

正在准备数据，例如：

a = [
[4,691776,100,5,691777],
[4,691776,100,6,691789],
[4,691776,100,7,691791],
[5,691777,100,4,691776],
[5,691777,100,6,691789],
[5,691777,100,7,691791],
[6,691789,100,4,691776],
[6,691789,100,5,691777],
[6,691789,100,7,691791],
[7,691791,100,4,691776],
[7,691791,100,5,691777],
[7,691791,100,6,691789],
[9,1407402,100,10,1407424],
[10,1407424,100,9,1407402]
]
c = ['Elemento_lista', 'Item_ID', 'Score', 'idx', 'ITEM_ID_Coincidencia']
df = pd.DataFrame(data = a, columns = c)
df

现在，您插入一列：它将包含 2 个排序索引的数组。

tuples_of_indexes = [sorted([x[0], x[3]]) for x in df.values]
df.insert(5, 'tuple_of_indexes', (tuples_of_indexes))

然后所有数据帧按插入的列排序：

df = df.sort_values(by=['tuple_of_indexes'])

然后删除重复插入列的行：

df = df[~df['tuple_of_indexes'].apply(tuple).duplicated()]

最后，您删除了插入的列：'tuple_of_indexes':

df.drop(['tuple_of_indexes'], axis=1)

输出为：

Elemento_lista  Item_ID Score   idx ITEM_ID_Coincidencia
0   4   691776  100 5   691777
1   4   691776  100 6   691789
2   4   691776  100 7   691791
4   5   691777  100 6   691789
5   5   691777  100 7   691791
8   6   691789  100 7   691791
12  9   1407402 100 10  1407424

Answer 2

这可以使用图论来解决。

您的 ID 之间存在以下关系：

所以你需要做的就是找到子图。

为此我们可以使用networkx's connected_components函数：

# pip install networkx
import networkx as nx
G = nx.from_pandas_edgelist(df, source='Elemento_lista', target='idx')

# get "first" (arbitrary) node for each subgraph
# note that sets (unsorted) are used
# so there is no guarantee on any node being "first" item
nodes = [tuple(g)[0] for g in nx.connected_components(G) if g]
# [4, 9]

# filter DataFrame
df2 = df[df['Elemento_lista'].isin(nodes)]

输出：

    Elemento_lista  Item_ID  Score  idx  ITEM_ID_Coincidencia
0                4   691776    100    5                691777
1                4   691776    100    6                691789
2                4   691776    100    7                691791
12               9  1407402    100   10               1407424

更新：真实数据

你的真实数据是超连接的，形成很好只有2组。

你可以在这里改变策略并使用有向图和strongly_connected_components

import networkx as nx
#df = pd.read_csv('ADIDAS_CALZADO.csv', index_col=0)
G = nx.from_pandas_edgelist(df, source='Elemento_lista', target='idx', create_using=nx.DiGraph)

# len(list(nx.strongly_connected_components(G)))
# 150 subgraphs

nodes = [tuple(g)[0] for g in nx.strongly_connected_components(G) if g]

df2 = df[df['Elemento_lista'].isin(nodes)]

# len(df2)
# only 2,910 nodes left out of the 25,371 initial ones

过滤后的新图 df2：

成对地在不同的列中删除相同的值（删除连接的组件）

drop same values in different columns by pair (drop connected components)

python

duplicates

dataframe

pandas

fuzzywuzzy

更新：真实数据