查找共享相同多对多关系的重复记录集

Question

我使用 Pandas 对 CSV 数据集进行预处理并将其转换为 SQLite 数据库。

我在两个实体 A 和 B 之间有一个多对多关系，由一个联结 DataFrame A2B.columns == ['AId', 'BId'] 表示。 As 的唯一性约束是每个 A 与 Bs 有不同的关系。

我想根据此约束有效删除重复项 A。我用 Pandas 这样做：

AId_dedup = A2B.groupby('AId').BId.apply(tuple).drop_duplicates().index

到元组的转换允许比较与每个 AId.

相关的 BIds 个集合

关系 A2B 可以看作是一个（稀疏布尔）矩阵，在 A 和 B 之间存在 link 的情况下为 1。我想删除这个矩阵的重复行，唉 pd.unstack() 不能生成稀疏矩阵。（还需要有效的行散列）

我的问题是：

我想做什么？在关系代数方面 ?
使用 Pandas 或 SQL 和（最好使用 SQLite）引擎可以更有效地完成它吗？

我想使用此操作在生物网络中查找同义词（重复对象），其中交互表示为 tables。

编辑：这是我想要的示例：

+-----+-----+
| Aid | Bid |
+-----+-----+
|   1 |   1 |
|   1 |   2 |
|   1 |   3 |
|   2 |   1 |
|   2 |   2 |
|   2 |   3 |
|   3 |   1 |
|   3 |   2 |
|   3 |   3 |
|   3 |   4 |
+-----+-----+

A2B = A2B.groupby('AId').BId.apply(tuple)
+-----+-----------+
| Aid |    Bid    |
+-----+-----------+
|   1 | (1,2,3)   |
|   2 | (1,2,3)   |
|   3 | (1,2,3,4) |
+-----+-----------+

A2B = A2B.drop_duplicates()
+-----+-----------+
| Aid |    Bid    |
+-----+-----------+
|   1 | (1,2,3)   |
|   3 | (1,2,3,4) |
+-----+-----------+

回到路口table（在Pandas中没那么容易）：

+-----+-----+
| Aid | Bid |
+-----+-----+
|   1 |   1 |
|   1 |   2 |
|   1 |   3 |
|   3 |   1 |
|   3 |   2 |
|   3 |   3 |
|   3 |   4 |
+-----+-----+

Answer 1

如果您可以重新创建 A2B table，则：创建新的具有唯一约束的 ('AId'、'BId')，然后像这样插入数据：

insert into new_A2B select distinct AId, BId from A2B;

然后通过带有 ON CONFLICT 子句的 sqlite 进行新插入，如下所示：

insert or ignore into new_A2B values (aid, bid);

如果您无法重新创建您的 A2B table，那么当您从它 select 行时使用不同的。
编辑：
您可以通过此查询查找重复 ID：

select A2B.aid, dup.aid
from A2B
left join A2B as dup on dup.bid = A2B.bid
group by A2B.aid, dup.aid
having count(A2B.bid) = count(dup.bid)
and count(A2B.bid) = (select count(bid) from A2B where aid = dup.aid)

如果你需要，你可以添加 where 条件来只为较低的 id 查找重复项

where A2B.aid < dup.aid

也许这个查询会更快：

with
  c as (select aid, count(1) as c
    from A2B
    group by aid) 
select A2B.aid, dup.aid
from A2B
inner join c as ac on ac.aid = A2B.aid
left join A2B as dup on dup.bid = A2B.bid and A2B.aid < dup.aid 
and exists(select 1 from c where aid = dup.aid and c = ac.c)
group by A2B.aid, dup.aid
having count(A2B.bid) = count(dup.bid)
and count(A2B.bid) = (select count(bid) from A2B where aid = dup.aid)

编辑：您可以测试另一种解决方案（这可能是最快的查询）：

with
  c as (select aid, min(bid) as f, max(bid) as l, count(1) as c
    --, sum(bid) as s
    from A2B
    group by aid) 
select f.aid, dup.aid
from c as f inner join c as dup 
on f.aid < dup.aid and f.f = dup.f and f.l - dup.l and f.c = dup.c 
--and f.s = dup.s
Where f.c = (
  select count(1) 
  where A2B as t1 
  inner join A2B as t2
  on t1.aid < t2.aid and t1.bid = t2.bid and t1.aid = f.aid and t2.aid = dup.aid)

您也可以尝试取消注释 sum(bid) as s & and f.s = dup.s

查找共享相同多对多关系的重复记录集

Finding duplicates sets of records sharing the same Many-To-Many relations

python

sql

sqlite

many-to-many

pandas