Pandas 合并 tables：只有来自第二个 table 的不同 ID

Question

我想弄清楚是否有可能 join/merge/concat 两个 table 而不是 'outer' 我想从第二个 table 与 pandas 内置选项。

现在我正在做一些事情并且感觉我的代码不是很优雅：

a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['c', '8', '1']]
b = [['a', '52', '49'], ['b', '23', '0.05'], ['x', '5', '0']]
df1 = pd.DataFrame(a, columns=['id_col', 'two', 'three'])
df2 = pd.DataFrame(b, columns=['id_col', 'two', 'three'])


# remove df2 entries also in df1
different_ids = set(df2.id_col).difference(set(df1.id_col))
df2 = df2[df2.id_col.isin(different_ids)]
# merge data frames
df_merged = pd.concat([df1,df2])

合并后的 df 应该有来自 df1 的条目 a、b、c 和来自 df2 的 x。

Answer 1

我认为你可以通过将 df2 与 id_col 进行子集化来完成所有这些工作，而 df2 不在 df1.id_col 与 isin 中，然后连接 df1 并得到结果数据框：

res = pd.concat([df1, df2[~df2.id_col.isin(df1.id_col)]])

In [186]: res
Out[186]:
  id_col  two three
0      a  1.2   4.2
1      b   70  0.03
2      c    8     1
2      x    5     0

时间：

In [23]: %timeit pd.concat((df1, df2)).drop_duplicates('id_col')
100 loops, best of 3: 1.95 ms per loop

In [24]: %timeit pd.concat([df1, df2[~df2.id_col.isin(df1.id_col)]])
100 loops, best of 3: 1.79 ms per loop

从时间比较来看，这个更快..

Answer 2

您可以 concat df1 和 df2 然后 drop_duplicates 在列 id_col.

>>> df = pd.concat((df1, df2))
>>> print(df.drop_duplicates('id_col'))
  id_col  two three
0      a  1.2   4.2
1      b   70  0.03
2      c    8     1
2      x    5     0

Pandas 合并 tables：只有来自第二个 table 的不同 ID

Pandas merge tables: only distinct Ids from second table

python

merge

concat

pandas