带有“== True”和 "is True" 的表达式给出不同的结果

Question

我有以下 MCVE:

#!/usr/bin/env python3                                           

import pandas as pd

df = pd.DataFrame([True, False, True])

print("Whole DataFrame:")
print(df)

print("\nFiltered DataFrame:")
print(df[df[0] == True])

输出如下，符合我的预期：

Whole DataFrame:
     0
  0  True
  1  False
  2  True

Filtered DataFrame:
     0
  0  True
  2  True

好的，但是PEP8的风格好像不对，它说：E712 comparison to True should be if cond is True or if cond .所以我将其更改为 is True 而不是 == True 但现在它失败了，输出是：

Whole DataFrame:
    0
0   True
1  False
2   True

Filtered DataFrame:
0     True
1    False
2     True
Name: 0, dtype: bool

这是怎么回事？

Answer 1

我认为 pandas 比较只适用于 ==，结果是 boolean Series。 is 输出为 False。有关 is.

的更多信息

print df[0] == True
0     True
1    False
2     True
Name: 0, dtype: bool

print df[df[0]]
      0
0  True
2  True

print df[df[0] == True]
      0
0  True
2  True

print df[0] is True
False

print df[df[0] is True]
0     True
1    False
2     True
Name: 0, dtype: bool

Answer 2

在python中，is测试一个对象是否与另一个相同。 == 由 pandas.Series 定义以按元素行事，is 不是。

因此，df[0] is True 比较 df[0] 和 True 是否是同一个对象。结果是 False，它又等于 0，所以你在做 df[df[0] is True]

时得到 0 列

Answer 3

这里要注意的是，在 df[df[0] == True] 中， 您没有将对象与 True.

进行比较

正如其他答案所说，== 在 pandas 中超载以生成 Series 而不是通常的 bool。 [] 也被重载，以解释 Series 并给出过滤后的结果。该代码本质上等同于：

series = df[0].__eq__(True)
df.__getitem__(series)

所以，您不将==留在此处违反了PEP8。

本质上，pandas 给出了熟悉的语法和不寻常的语义 - 这就是造成混淆的原因。

According to Stroustroup (sec.3.3.3), operator overloading has been causing trouble due to this ever since its invention (and he had to think hard whether to include it into C++). Seeing even more abuse of it in C++，高斯林运行 Java 中的另一个极端，完全禁止它，事实证明这是一个极端。

因此，现代语言和代码往往会出现运算符重载，但请密切注意不要过度使用它，并确保语义保持一致。

Answer 4

这是对 MaxNoe 回答的详细说明，因为这太冗长了，无法包含在内在评论中。

正如他所指出的，df[0] is True 求值为 False，然后被强制转换到 0 对应于列名。有趣的是如果你运行

>>>df = pd.DataFrame([True, False, True])
>>>df[False]
KeyError                                  Traceback (most recent call last)
<ipython-input-21-62b48754461f> in <module>()
----> 1 df[False]

>>>df[0]
0     True
1    False
2     True
Name: 0, dtype: bool
>>>df[False]
0     True
1    False
2     True
Name: 0, dtype: bool

起初（至少对我而言）这似乎有点令人困惑，但与如何 pandas 使用缓存。如果你看看 df[False] 是如何解决的，它看起来像

  /home/matthew/anaconda/lib/python2.7/site-packages/pandas/core/frame.py(1975)__getitem__()
-> return self._getitem_column(key)
  /home/matthew/anaconda/lib/python2.7/site-packages/pandas/core/frame.py(1999)_getitem_column()
-> return self._get_item_cache(key)
> /home/matthew/anaconda/lib/python2.7/site-packages/pandas/core/generic.py(1343)_get_item_cache()
-> res = cache.get(item)

因为 cache 只是一个普通的 python dict，在运行ning df[0] 之后 cache 看起来像

>>>cache
{0: 0     True
1    False
2     True
Name: 0, dtype: bool}

所以当我们查找 False 时，python 将其强制为 0。如果我们没有已经使用 df[0] 准备好缓存，然后 res 是 None 触发 KeyError generic.py

第 1345 行

def _get_item_cache(self, item):
1341            """Return the cached item, item represents a label indexer."""
1342            cache = self._item_cache
1343 ->         res = cache.get(item)
1344            if res is None:
1345                values = self._data.get(item)

Answer 5

一个没有来自 linters 的投诉但仍然合理的子设置语法的解决方法可能是：

s = pd.Series([True] * 10 + [False])

s.loc[s == True]  # bad comparison in Python's eyes
s.loc[s.isin([True])]  # valid comparison, not as ugly as s.__eq__(True)

两者也需要相同的时间。

此外，对于数据帧，可以使用 query:

df = pd.DataFrame([
        [True] * 10 + [False],
        list(range(11))],
    index=['T', 'N']).T
df.query("T == True")  # also okay

带有“== True”和 "is True" 的表达式给出不同的结果

Expressions with "== True" and "is True" give different results

python

pep8

pandas