Python 忽略 nan 的比较
Python comparison ignoring nan
虽然 nan == nan
始终是 False
,但在很多情况下人们希望平等对待他们,这体现在 pandas.DataFrame.equals
:
NaNs in the same location are considered equal.
当然可以写
def equalp(x, y):
return (x == y) or (math.isnan(x) and math.isnan(y))
但是,这将在 [float("nan")]
和 isnan
等非数字 barfs 容器上失败(因此 the complexity increases)。
那么,人们如何比较可能包含 nan
的复杂 Python 对象?
PS。动机:当比较 pandas DataFrame
中的两行时,我会 convert them into dict
s 并按元素比较字典。
我假设你有数组数据或者至少可以转换为 numpy 数组?
一种方法是使用 numpy.ma
数组屏蔽所有 nan,然后比较数组。所以你的开始情况是……。像这样
import numpy as np
import numpy.ma as ma
arr1 = ma.array([3,4,6,np.nan,2])
arr2 = ma.array([3,4,6,np.nan,2])
print arr1 == arr2
print ma.all(arr1==arr2)
>>> [ True True True False True]
>>> False # <-- you want this to show True
解决方案:
arr1[np.isnan(arr1)] = ma.masked
arr2[np.isnan(arr2)] = ma.masked
print arr1 == arr2
print ma.all(arr1==arr2)
>>> [True True True -- True]
>>> True
假设您有一个包含 nan
个值的数据框:
In [10]: df = pd.DataFrame(np.random.randint(0, 20, (10, 10)).astype(float), columns=["c%d"%d for d in range(10)])
In [10]: df.where(np.random.randint(0,2, df.shape).astype(bool), np.nan, inplace=True)
In [10]: df
Out[10]:
c0 c1 c2 c3 c4 c5 c6 c7 c8 c9
0 NaN 6.0 14.0 NaN 5.0 NaN 2.0 12.0 3.0 7.0
1 NaN 6.0 5.0 17.0 NaN NaN 13.0 NaN NaN NaN
2 NaN 17.0 NaN 8.0 6.0 NaN NaN 13.0 NaN NaN
3 3.0 NaN NaN 15.0 NaN 8.0 3.0 NaN 3.0 NaN
4 7.0 8.0 7.0 NaN 9.0 19.0 NaN 0.0 NaN 11.0
5 NaN NaN 14.0 2.0 NaN NaN 0.0 NaN NaN 8.0
6 3.0 13.0 NaN NaN NaN NaN NaN 12.0 3.0 NaN
7 13.0 14.0 NaN 5.0 13.0 NaN 18.0 6.0 NaN 5.0
8 3.0 9.0 14.0 19.0 11.0 NaN NaN NaN NaN 5.0
9 3.0 17.0 NaN NaN 0.0 NaN 11.0 NaN NaN 0.0
你想比较行,比如第 0 行和第 8 行。然后只需使用 fillna
并进行矢量化比较:
In [12]: df.iloc[0,:].fillna(0) != df.iloc[8,:].fillna(0)
Out[12]:
c0 True
c1 True
c2 False
c3 True
c4 True
c5 False
c6 True
c7 True
c8 True
c9 True
dtype: bool
如果您只想知道哪些列不同,您可以使用生成的布尔数组对列进行索引:
In [14]: df.columns[df.iloc[0,:].fillna(0) != df.iloc[8,:].fillna(0)]
Out[14]: Index(['c0', 'c1', 'c3', 'c4', 'c6', 'c7', 'c8', 'c9'], dtype='object')
这是一个递归到数据结构中的函数,用唯一字符串替换 nan
值。我写这篇文章是为了比较可能包含 nan
.
的数据结构的单元测试
它只是为dict
和list
构成的数据结构而设计的,但是很容易看出如何扩展它。
from math import isnan
from uuid import uuid4
from typing import Union
NAN_REPLACEMENT = f"THIS_WAS_A_NAN{uuid4()}"
def replace_nans(data_structure: Union[dict, list]) -> Union[dict, list]:
if isinstance(data_structure, dict):
iterme = data_structure.items()
elif isinstance(data_structure, list):
iterme = enumerate(data_structure)
else:
raise ValueError(
"replace_nans should only be called on structures made of dicts and lists"
)
for key, value in iterme:
if isinstance(value, float) and isnan(value):
data_structure[key] = NAN_REPLACEMENT
elif isinstance(value, dict) or isinstance(value, list):
data_structure[key] = replace_nans(data_structure[key])
return data_structure
虽然 nan == nan
始终是 False
,但在很多情况下人们希望平等对待他们,这体现在 pandas.DataFrame.equals
:
NaNs in the same location are considered equal.
当然可以写
def equalp(x, y):
return (x == y) or (math.isnan(x) and math.isnan(y))
但是,这将在 [float("nan")]
和 isnan
等非数字 barfs 容器上失败(因此 the complexity increases)。
那么,人们如何比较可能包含 nan
的复杂 Python 对象?
PS。动机:当比较 pandas DataFrame
中的两行时,我会 convert them into dict
s 并按元素比较字典。
我假设你有数组数据或者至少可以转换为 numpy 数组?
一种方法是使用 numpy.ma
数组屏蔽所有 nan,然后比较数组。所以你的开始情况是……。像这样
import numpy as np
import numpy.ma as ma
arr1 = ma.array([3,4,6,np.nan,2])
arr2 = ma.array([3,4,6,np.nan,2])
print arr1 == arr2
print ma.all(arr1==arr2)
>>> [ True True True False True]
>>> False # <-- you want this to show True
解决方案:
arr1[np.isnan(arr1)] = ma.masked
arr2[np.isnan(arr2)] = ma.masked
print arr1 == arr2
print ma.all(arr1==arr2)
>>> [True True True -- True]
>>> True
假设您有一个包含 nan
个值的数据框:
In [10]: df = pd.DataFrame(np.random.randint(0, 20, (10, 10)).astype(float), columns=["c%d"%d for d in range(10)])
In [10]: df.where(np.random.randint(0,2, df.shape).astype(bool), np.nan, inplace=True)
In [10]: df
Out[10]:
c0 c1 c2 c3 c4 c5 c6 c7 c8 c9
0 NaN 6.0 14.0 NaN 5.0 NaN 2.0 12.0 3.0 7.0
1 NaN 6.0 5.0 17.0 NaN NaN 13.0 NaN NaN NaN
2 NaN 17.0 NaN 8.0 6.0 NaN NaN 13.0 NaN NaN
3 3.0 NaN NaN 15.0 NaN 8.0 3.0 NaN 3.0 NaN
4 7.0 8.0 7.0 NaN 9.0 19.0 NaN 0.0 NaN 11.0
5 NaN NaN 14.0 2.0 NaN NaN 0.0 NaN NaN 8.0
6 3.0 13.0 NaN NaN NaN NaN NaN 12.0 3.0 NaN
7 13.0 14.0 NaN 5.0 13.0 NaN 18.0 6.0 NaN 5.0
8 3.0 9.0 14.0 19.0 11.0 NaN NaN NaN NaN 5.0
9 3.0 17.0 NaN NaN 0.0 NaN 11.0 NaN NaN 0.0
你想比较行,比如第 0 行和第 8 行。然后只需使用 fillna
并进行矢量化比较:
In [12]: df.iloc[0,:].fillna(0) != df.iloc[8,:].fillna(0)
Out[12]:
c0 True
c1 True
c2 False
c3 True
c4 True
c5 False
c6 True
c7 True
c8 True
c9 True
dtype: bool
如果您只想知道哪些列不同,您可以使用生成的布尔数组对列进行索引:
In [14]: df.columns[df.iloc[0,:].fillna(0) != df.iloc[8,:].fillna(0)]
Out[14]: Index(['c0', 'c1', 'c3', 'c4', 'c6', 'c7', 'c8', 'c9'], dtype='object')
这是一个递归到数据结构中的函数,用唯一字符串替换 nan
值。我写这篇文章是为了比较可能包含 nan
.
它只是为dict
和list
构成的数据结构而设计的,但是很容易看出如何扩展它。
from math import isnan
from uuid import uuid4
from typing import Union
NAN_REPLACEMENT = f"THIS_WAS_A_NAN{uuid4()}"
def replace_nans(data_structure: Union[dict, list]) -> Union[dict, list]:
if isinstance(data_structure, dict):
iterme = data_structure.items()
elif isinstance(data_structure, list):
iterme = enumerate(data_structure)
else:
raise ValueError(
"replace_nans should only be called on structures made of dicts and lists"
)
for key, value in iterme:
if isinstance(value, float) and isnan(value):
data_structure[key] = NAN_REPLACEMENT
elif isinstance(value, dict) or isinstance(value, list):
data_structure[key] = replace_nans(data_structure[key])
return data_structure