应用字典查找功能来比较 pandas 数据框列

Question

我想将函数应用于 pandas 数据框的两列（A 和 B），以测试它们的每个值是否与字典中的相同结果相匹配。我希望它 return 结果到第三列。

我已经尝试了下面的代码并关闭了变体，但我不断收到错误，而且我认为有一些基本的东西我不了解数据结构。谁能解释我哪里出错了？我可以想象执行此操作的繁琐替代方法，但我确信必须有一个优雅的解决方案。

def do_they_match(A1,A2):
    if A1 in dictionary and A2 in dictionary and dictionary[A1] == dictionary[A2]:
        return 1
    else:
        return 0

df['match'] = df.apply(lambda x: do_they_match(x['A'],x['B']))
## also tried ## 
df = df.assign(link=lambda x: do_they_match(x['A'],x['B']))

对于上下文，我得到的错误是 IndexError: ('A', 'occurred at index A') 或 TypeError: 'Series' objects are mutable, thus they cannot be hashed 对于最后 line.The 数据框列和字典中的值的替代代码都是字符串。

感谢您的帮助！

Answer 1

您收到错误是因为您试图应用来自同一行的函数传递数据参数，但使用您的语法， lambda x 中的 x 指的是列。因此，代码 x['A'] 实际上是试图将 A 作为当前正在处理的列 x 的行索引。数据框中的每一列将轮流作为要在此 apply 语句中处理的列。

您必须使用 .apply() 的 axis= 参数来指示 Pandas 通过传递 axis=1.

进行逐行操作

Official document对axis参数有解释：

axis {0 or ‘index’, 1 or ‘columns’}, default 0 Axis along which the function is applied:

0 or ‘index’: apply function to each column.

1 or ‘columns’: apply function to each row.

默认值为 axis=0 以将函数应用于每个列。

要解决该错误，您可以将 axis=1 添加到您的代码中：

df['match'] = df.apply(lambda x: do_they_match(x['A'], x['B']), axis=1)

更好的解决方案 是您不需要定义自定义函数，也许您可以使用 Pandas 函数映射字典值 .map() 如下：

df['match'] = (df['A'].map(dictionary) == df['B'].map(dictionary)).astype(int)

我们使用 astype() 将布尔结果转换为整数 0（对于 False）和 1（对于 True）

例如，我们有以下数据框和dictionary：

df = pd.DataFrame({'A': ['x1', 'x2', 'x3'], 'B': ['y1', 'y2', 'y3']})

    A   B
0  x1  y1
1  x2  y2
2  x3  y3


dictionary = {'x1': 'apple', 'y1': 'orange', 'x2': 'banana', 'y2': 'banana', 'x3': 'peach'}

当我们应用代码时，我们得到：

df['match'] = (df['A'].map(dictionary) == df['B'].map(dictionary)).astype(int)

print(df)

    A   B  match
0  x1  y1      0
1  x2  y2      1
2  x3  y3      0

应用字典查找功能来比较 pandas 数据框列

Apply dictionary look-up function to compare pandas dataframe columns

python

mapping

dictionary

apply

pandas