Python Pandas：连接两个 table 不重复但也不更改第一个 table

Question

我需要：

加入 table1 和 table2
消除重复项
保留 table1
一个字典来说明哪个是旧的 id table 哪个是新的 id

示例：输出将是这样的

PS：事情是，table1 源自一个已经在生产数据库中，我在这里的 ID 被用于许多其他 table，所以我不能更改已经存在的内容，仅添加尚未存在的新数据。但我还需要说明数据的新 ID 是什么。

table1

id   name        birthdate     
1    Goku        1997-12-15 
2    Freeza      2000-10-03
3    Vegeta      2003-08-19

table2

id    name        birthdate
1     Krillin     1983-02-28
2     Roshi       1960-06-07
3     Goku        1997-12-15
4     Freeza      1998-10-10

所以我需要从中生成以下内容

resulting_table1

id    name        birthdate     
1     Goku        1997-12-15 
2     Freeza      2000-10-03
3     Vegeta      2003-08-19
4     Krillin     1983-02-28
5     Roshi       1960-06-07
6     Freeza      1998-10-10

但我还需要一个 table 来说明旧代码 table 和新代码，这也是类似的东西：

from_to_table

id   origin      new_id
1    table_1     1
2    table_1     2
3    table_1     3
1    table_2     4
2    table_2     5
3    table_2     1
4    table_2     6

我已经尝试了很多方法，现在我唯一可以使用的方法是逐行插入并每次都检查两个字段，但这会花费太多时间使其不可行。

到目前为止，我发现的最佳方法基本上包括：连接两个 tables -> 分组数据并生成新的 id 列 -> 连接分组的 table 和连接的两个 tables 以创建 from_to_table 问题是，这种方法会更改我不能更改的 ID，而且我不知道如何保留这些 ID。

Answer 1

我假设 id 是一个列，而不是索引：

table1 =
   id    name   birthdate
0   1    Goku  1997-12-15
1   2  Freeza  2000-10-03
2   3  Vegeta  2003-08-19

那么您可以尝试以下方法：

(1) 创建一个包含一些附加内容的联接 table_tmp：

table_tmp = pd.concat([table1.assign(table=1), table2.assign(table=2)])

   id     name   birthdate  table
0   1     Goku  1997-12-15      1
1   2   Freeza  2000-10-03      1
2   3   Vegeta  2003-08-19      1
0   1  Krillin  1983-02-28      2
1   2    Roshi  1960-06-07      2
2   3     Goku  1997-12-15      2
3   4   Freeza  1998-10-10      2

(2) 在此基础上创建 resulting_table1:

resulting_table1 = (
    table_tmp
    .drop_duplicates(["name", "birthdate"])
    .reset_index(drop=True)
    .assign(id=lambda df: df.index + 1)
    .drop(columns="table")
)

   id     name   birthdate
0   1     Goku  1997-12-15
1   2   Freeza  2000-10-03
2   3   Vegeta  2003-08-19
3   4  Krillin  1983-02-28
4   5    Roshi  1960-06-07
5   6   Freeza  1998-10-10

(3) 然后使用两者来创建 from_to_table:

from_to_table = (
    table_tmp
    .merge(resulting_table1, on=["name", "birthdate"], how="left")
    .drop(columns=["name", "birthdate"])
    .rename(columns={"id_x": "id", "id_y": "id_new"})
)

   id  table  id_new
0   1      1       1
1   2      1       2
2   3      1       3
3   1      2       4
4   2      2       5
5   3      2       1
6   4      2       6

Answer 2

对于 resulting_table1，我建议使用 merge 在列 name 和 birthdate 上进行外部联接，然后 re-create id 列：

resulting_table1 = pd.merge(table1, table2, on=['name','birthdate'], how='outer')[['name','birthdate']]
resulting_table1['id'] = range(1, len(resulting_table1)+1)

对于 from_to_table，您可以使用另一个外部联接（这次是在所有列上）并使用 indicator 标志来保留有关源的信息 table:

from_to_table = pd.merge(table1, table2, how='outer', indicator='origin').replace({'origin':{'left_only':'table_1', 'right_only':'table_2'}})

最后对新 id 进行 resulting_table1 的左连接：

from_to_table = from_to_table.merge(resulting_table1, on=['name','birthdate'], how="left")

Python Pandas：连接两个 table 不重复但也不更改第一个 table

Python Pandas: Join two tables keeping no duplicates but also not changing the first table

python

merge

join

group-by

pandas