如何在 Pandas 合并中指定分层列？

Question

在严重误解 on 在 join 中的工作原理之后（剧透：与 merge 中的 on 非常不同），这是我的示例代码。

import pandas as pd

index1 = pd.MultiIndex.from_product([["variables"], ["number", "fruit"]])
df1 = pd.DataFrame([["one", "apple"], ["two", "banana"]], columns=index1)

index2 = pd.MultiIndex.from_product([["variables"], ["fruit", "color"]])
df2 = pd.DataFrame([["banana", "yellow"]], columns=index2)

print(df1.merge(df2, on="fruit", how="left"))

我得到一个 KeyError。我如何在此处正确引用 variables.fruit？

要了解我所追求的，请考虑没有多索引的相同问题：

import pandas as pd

df1 = pd.DataFrame([["one", "apple"], ["two", "banana"]], columns=["number", "fruit"])
df2 = pd.DataFrame([["banana", "yellow"]], columns=["fruit", "color"])

# this is obviously incorrect as it uses indexes on `df1` as well as `df2`:
print(df1.join(df2, rsuffix="_"))

# this is *also* incorrect, although I initially thought it should work, but it uses the index on `df2`:
print(df1.join(df2, on="fruit", rsuffix="_"))

# this is correct:
print(df1.merge(df2, on="fruit", how="left"))

预期和想要的结果是这样的：

  number   fruit   color
0    one   apple     NaN
1    two  banana  yellow

当 fruit 是多索引的一部分时，我如何得到相同的值？

Answer 1

我想我明白你现在想要完成什么，我不认为 join 会让你到达那里。 DataFrame.join 和 DataFrame.merge 都会调用 pandas.core.reshape.merge.merge，但使用 DataFrame.merge 可以让您更好地控制应用的默认值。

在您的情况下，您可以使用引用列来通过元组列表加入，其中元组的元素是 multi-indexed 列的级别。 IE。要使用 variables / fruit 列，您可以传递 [('variables', 'fruit')].

使用元组是索引 multi-index 列（和行索引）的方式。您需要将其包装在一个列表中，因为可以使用多个列或多个 multi-indexed 列执行合并操作，例如 SQL 中的 JOIN 语句。传递单个字符串只是一种方便的情况，它会为您包装在一个列表中。

由于您只加入 1 列，因此它是一个元组列表。

import pandas as pd

index1 = pd.MultiIndex.from_product([["variables"], ["number", "fruit"]])
df1 = pd.DataFrame([["one", "apple"], ["two", "banana"]], columns=index1)

index2 = pd.MultiIndex.from_product([["variables"], ["fruit", "color"]])
df2 = pd.DataFrame([["banana", "yellow"]], columns=index2)

df1.merge(df2, how='left', on=[('variables', 'fruit')])
# returns:
  variables
     number   fruit   color
0       one   apple     NaN
1       two  banana  yellow

如何在 Pandas 合并中指定分层列？

How to specify hierarchical columns in Pandas merge?

python

join

hierarchical-data

dataframe

pandas