仅外连接 python pandas
only outer join python pandas
我有两个 DataFrame,它们具有相同的列名以及一些匹配数据和一些唯一数据。
我想排除中间部分,只保存两个 DataFrame 独有的部分。
我将如何连接、合并或加入这两个数据框?
例如这张图我不想要这张图的中间,我想要两边而不是中间:
这是我现在的代码:
def query_to_df(query):
...
df_a = pd.DataFrame(data_a)
df_b = pd.DataFrame(data_b)
outer_results = pd.concat([df_a, df_b], axis=1, join='outer')
return df
让我给你举个我需要的例子:
df_a =
col_a col_b col_c
a1 b1 c1
a2 b2 c2
df_b =
col_a col_b col_c
a2 b2 c2
a3 b3 c3
# they only share the 2nd row: a2 b2 c2
# so the outer result should be:
col_a col_b col_c col_a col_b col_c
a1 b1 c1 NA NA NA
NA NA NA a3 b3 c3
或者我对 2 个数据帧同样满意
result_1 =
col_a col_b col_c
a1 b1 c1
result_2 =
col_a col_b col_c
a3 b3 c3
最后,您会注意到 a2 b2 c2
被排除在外,因为所有列都匹配 - 我如何指定我想根据所有列加入,而不仅仅是 1?如果 df_a
有 a2 foo c2
,我希望该行也位于 result_1
中。
使用merge
with indicator
parameter and outer
join first and then filter by query
or boolean indexing
:
df = df_a.merge(df_b, how='outer', indicator=True)
print (df)
col_a col_b col_c _merge
0 a1 b1 c1 left_only
1 a2 b2 c2 both
2 a3 b3 c3 right_only
a = df.query('_merge == "left_only"').drop('_merge', 1)
print (a)
col_a col_b col_c
0 a1 b1 c1
b = df.query('_merge == "right_only"').drop('_merge', 1)
print (b)
col_a col_b col_c
2 a3 b3 c3
或:
a = df[df['_merge'] == "left_only"].drop('_merge', 1)
print (a)
col_a col_b col_c
0 a1 b1 c1
b = df[df['_merge'] == "right_only"].drop('_merge', 1)
print (b)
col_a col_b col_c
2 a3 b3 c3
使用pd.DataFrame.drop_duplicates
这假设行在它们各自的数据框中是唯一的。
df_a.append(df_b).drop_duplicates(keep=False)
col_a col_b col_c
0 a1 b1 c1
1 a3 b3 c3
您甚至可以使用 pd.concat
和 keys
参数来给出行所在的上下文。
pd.concat([df_a, df_b], keys=['a', 'b']).drop_duplicates(keep=False)
col_a col_b col_c
a 0 a1 b1 c1
b 1 a3 b3 c3
concat 和 drop_duplicates with keep = False
new_df = pd.concat([df_a, df_b]).drop_duplicates(keep=False)
col_a col_b col_c
0 a1 b1 c1
1 a3 b3 c3
使用 numpy setdiff1
df_a = pd.DataFrame(np.setdiff1d(np.array(df_a.values), np.array(df_b.values))\
.reshape(-1, df_a.shape[1]), columns = df_a.columns)
df_b = pd.DataFrame(np.setdiff1d(np.array(df_b.values), np.array(df_a.values))\
.reshape(-1, df_b.shape[1]), columns = df_b.columns)
df_a
col_a col_b col_c
0 a1 b1 c1
df_b
col_a col_b col_c
0 a3 b3 c3
我有两个 DataFrame,它们具有相同的列名以及一些匹配数据和一些唯一数据。
我想排除中间部分,只保存两个 DataFrame 独有的部分。
我将如何连接、合并或加入这两个数据框?
例如这张图我不想要这张图的中间,我想要两边而不是中间:
这是我现在的代码:
def query_to_df(query):
...
df_a = pd.DataFrame(data_a)
df_b = pd.DataFrame(data_b)
outer_results = pd.concat([df_a, df_b], axis=1, join='outer')
return df
让我给你举个我需要的例子:
df_a =
col_a col_b col_c
a1 b1 c1
a2 b2 c2
df_b =
col_a col_b col_c
a2 b2 c2
a3 b3 c3
# they only share the 2nd row: a2 b2 c2
# so the outer result should be:
col_a col_b col_c col_a col_b col_c
a1 b1 c1 NA NA NA
NA NA NA a3 b3 c3
或者我对 2 个数据帧同样满意
result_1 =
col_a col_b col_c
a1 b1 c1
result_2 =
col_a col_b col_c
a3 b3 c3
最后,您会注意到 a2 b2 c2
被排除在外,因为所有列都匹配 - 我如何指定我想根据所有列加入,而不仅仅是 1?如果 df_a
有 a2 foo c2
,我希望该行也位于 result_1
中。
使用merge
with indicator
parameter and outer
join first and then filter by query
or boolean indexing
:
df = df_a.merge(df_b, how='outer', indicator=True)
print (df)
col_a col_b col_c _merge
0 a1 b1 c1 left_only
1 a2 b2 c2 both
2 a3 b3 c3 right_only
a = df.query('_merge == "left_only"').drop('_merge', 1)
print (a)
col_a col_b col_c
0 a1 b1 c1
b = df.query('_merge == "right_only"').drop('_merge', 1)
print (b)
col_a col_b col_c
2 a3 b3 c3
或:
a = df[df['_merge'] == "left_only"].drop('_merge', 1)
print (a)
col_a col_b col_c
0 a1 b1 c1
b = df[df['_merge'] == "right_only"].drop('_merge', 1)
print (b)
col_a col_b col_c
2 a3 b3 c3
使用pd.DataFrame.drop_duplicates
这假设行在它们各自的数据框中是唯一的。
df_a.append(df_b).drop_duplicates(keep=False)
col_a col_b col_c
0 a1 b1 c1
1 a3 b3 c3
您甚至可以使用 pd.concat
和 keys
参数来给出行所在的上下文。
pd.concat([df_a, df_b], keys=['a', 'b']).drop_duplicates(keep=False)
col_a col_b col_c
a 0 a1 b1 c1
b 1 a3 b3 c3
concat 和 drop_duplicates with keep = False
new_df = pd.concat([df_a, df_b]).drop_duplicates(keep=False)
col_a col_b col_c
0 a1 b1 c1
1 a3 b3 c3
使用 numpy setdiff1
df_a = pd.DataFrame(np.setdiff1d(np.array(df_a.values), np.array(df_b.values))\
.reshape(-1, df_a.shape[1]), columns = df_a.columns)
df_b = pd.DataFrame(np.setdiff1d(np.array(df_b.values), np.array(df_a.values))\
.reshape(-1, df_b.shape[1]), columns = df_b.columns)
df_a
col_a col_b col_c
0 a1 b1 c1
df_b
col_a col_b col_c
0 a3 b3 c3