为什么 .loc 并不总是匹配列名？

Question

今天看到这个，有点迷茫想问一下

假设我们有两个 df

df = pd.DataFrame(np.random.randint(0,9,size=(5,3)),columns = list('ABC'))
    A   B   C
0   3   1   6
1   2   4   0
2   8   8   0
3   8   6   7
4   4   5   0

df2 = pd.DataFrame(np.random.randint(0,9,size=(5,3)),columns = list('CBA'))

    C   B   A
0   3   5   5
1   7   4   6
2   0   7   7
3   6   6   5
4   4   0   6

如果我们想在第一个 df 中有条件地分配新值，我们可以这样做：

df.loc[df['A'].gt(3)] = df2

我希望列对齐，如果缺少列，则第一个 df 中的值将填充 nan。但是，当上面的代码是运行时，它会替换数据并且不会考虑列名。（但是它确实考虑了索引名称）

    A   B   C
0   3   1   6
1   2   4   0
2   0   7   7
3   6   6   5
4   4   0   6

在 index 2 而不是 [7,7,0] 我们有 [0,7,7].

但是，如果我们将列名传递到 loc 语句中，而不更改 df2 中列的顺序，它会与列对齐。

df.loc[df['A'].gt(3),['A','B','C']] = df2
    A   B   C
0   3   1   6
1   2   4   0
2   7   7   0
3   5   6   6
4   6   0   4

为什么会这样？

Answer 1

有趣的是，loc 执行了大量优化以提高性能，其中一项优化是检查传入索引的类型。

包括行索引和列索引

当同时传递行索引和列索引时 __setitem__ 函数：

def __setitem__(self, key, value):
    if isinstance(key, tuple):
        key = tuple(com.apply_if_callable(x, self.obj) for x in key)
    else:
        key = com.apply_if_callable(key, self.obj)
    indexer = self._get_setitem_indexer(key)
    self._has_valid_setitem_indexer(key)

    iloc = self if self.name == "iloc" else self.obj.iloc
    iloc._setitem_with_indexer(indexer, value, self.name)

将 key 解释为元组。

key:

(0    False
1    False
2     True
3     True
4     True
Name: A, dtype: bool, 
['A', 'B', 'C'])

然后传递给 _get_setitem_indexer 以从基于标签的位置索引器转换为位置索引器：

indexer = self._get_setitem_indexer(key)

def _get_setitem_indexer(self, key):
    """
    Convert a potentially-label-based key into a positional indexer.
    """
    if self.name == "loc":
        self._ensure_listlike_indexer(key)

    if self.axis is not None:
        return self._convert_tuple(key, is_setter=True)

    ax = self.obj._get_axis(0)

    if isinstance(ax, ABCMultiIndex) and self.name != "iloc":
        with suppress(TypeError, KeyError, InvalidIndexError):
            # TypeError e.g. passed a bool
            return ax.get_loc(key)

    if isinstance(key, tuple):
        with suppress(IndexingError):
            return self._convert_tuple(key, is_setter=True)

    if isinstance(key, range):
        return list(key)

    try:
        return self._convert_to_indexer(key, axis=0, is_setter=True)
    except TypeError as e:

        # invalid indexer type vs 'other' indexing errors
        if "cannot do" in str(e):
            raise
        elif "unhashable type" in str(e):
            raise
        raise IndexingError(key) from e

这会生成一个元组索引器（行和列都被转换）：

if isinstance(key, tuple):
    with suppress(IndexingError):
        return self._convert_tuple(key, is_setter=True)

returns

(array([2, 3, 4], dtype=int64), array([0, 1, 2], dtype=int64))

仅包含行索引

但是，当只有行索引传递给 loc 时，索引器不是元组，因此，只有一个维度从标签转换为位置：

if isinstance(key, range):
    return list(key)

returns

[2 3 4]

由于这个原因，当只有一个值传递给 loc 时，列之间不会发生对齐，因为没有进行解析来对齐列。

这就是经常使用空切片的原因：

df.loc[df['A'].gt(3), :] = df2

因为这足以适当地对齐列。

import numpy as np
import pandas as pd

np.random.seed(5)
df = pd.DataFrame(np.random.randint(0, 9, size=(5, 3)), columns=list('ABC'))
df2 = pd.DataFrame(np.random.randint(0, 9, size=(5, 3)), columns=list('CBA'))
print(df)
print(df2)

df.loc[df['A'].gt(3), :] = df2
print(df)

示例：

df:

df2:

df.loc[df['A'].gt(3), :] = df2:

   A  B  C
0  3  6  6
1  0  8  4
2  0  5  0
3  4  4  0  # Aligned as expected
4  4  2  3

为什么 .loc 并不总是匹配列名？

Why does .loc not always match column names?

pandas

pandas-loc

包括行索引和列索引

仅包含行索引