Pandas 如何处理类型为 "object" 的列与整数进行比较的情况？

Question

我的问题是关于 pandas 用于将类型为 "object" 的列与整数进行比较的规则。这是我的代码：

In [334]: df
Out[334]: 
     c1    c2        c3  c4
id1   1    li -0.367860   5
id2   2  zhao -0.596926   5
id3   3   sun  0.493806   5
id4   4  wang -0.311407   5
id5   5  wang  0.253646   5

In [335]: df < 2
Out[335]: 
        c1    c2    c3     c4
id1   True  True  True  False
id2  False  True  True  False
id3  False  True  True  False
id4  False  True  True  False
id5  False  True  True  False

In [336]: df.dtypes
Out[336]: 
c1      int64
c2     object
c3    float64
c4      int64
dtype: object

为什么 "c2" 列全部得到 True？

P.S。我也试过：

In [333]: np.less(np.array(["s","b"]),2)
Out[333]: NotImplemented

Answer 1

对于 DataFrame，与标量的比较总是 return具有所有布尔列的 DataFrame。

我认为它没有正式记录在任何地方，但源代码中有一条评论（见下文）确认了预期的行为：

[for] straight boolean comparisons [between a DataFrame and a scalar] we want to allow all columns (regardless of dtype to pass thru) See #4537 for discussion.

实际上，这意味着每列的所有比较都必须 return True 或 False。任何无效比较（例如 'li' < 2）应默认为这些布尔值之一。

简而言之，pandas 开发人员决定它应该默认为 True。

在 #4537 中对此行为进行了一些讨论，并提出了使用 False 的一些论点，或者将比较限制为仅具有兼容类型的列，但票证已关闭且未更改任何代码.

如果您有兴趣，可以在 ops.py:

中的内部方法中查看默认值用于无效比较的位置

def _comp_method_FRAME(cls, func, special):
    str_rep = _get_opstr(func, cls)
    op_name = _get_op_name(func, special)

    @Appender('Wrapper for comparison method {name}'.format(name=op_name))
    def f(self, other):
        if isinstance(other, ABCDataFrame):
            # Another DataFrame
            if not self._indexed_same(other):
                raise ValueError('Can only compare identically-labeled '
                                 'DataFrame objects')
            return self._compare_frame(other, func, str_rep)

        elif isinstance(other, ABCSeries):
            return _combine_series_frame(self, other, func,
                                         fill_value=None, axis=None,
                                         level=None, try_cast=False)
        else:

            # straight boolean comparisons we want to allow all columns
            # (regardless of dtype to pass thru) See #4537 for discussion.
            res = self._combine_const(other, func,
                                      errors='ignore',
                                      try_cast=False)
            return res.fillna(True).astype(bool)

    f.__name__ = op_name
    return f

else 块是我们对标量情况感兴趣的块。

请注意 errors='ignore' 参数，这意味着无效比较将 return NaN（而不是引发错误）。 res.fillna(True) 用 True 填充这些失败的比较。

Pandas 如何处理类型为 "object" 的列与整数进行比较的情况？

How does the Pandas deal with the situation when a column with type "object" is compared with an integer?

python

comparison-operators

dataframe

pandas