在忽略 NaN 的同时识别列中的相等性

Identify equality in columns while ignoring NaNs

如何忽略等于 pandas 的 empty/NaN 列。

因此 returns TRUE 是第 2 列与第 1 列相同,并且当第 2 列包含 NaN

df['col1'].equals(df['col2'])

使用与 col2 相同的中间列(系列),但将 NaN 值设置为 col1 中的值。

import pandas as pd
df = pd.DataFrame({'col1': [1., 2, 3, 4, 5, 6], 'col2': [1, 2, None, 4, 5, None]})
df['col1'].equals(df['col2'])
s = df['col2'].fillna(df['col1'])
df['col1'].equals(s)

您可以使用布尔过滤(两列中 NA/nan 对称):

mask = df['col1'].notna() & df['col2'].notna()
df.loc[mask, 'col1'].equals(df.loc[mask, 'col2'])

我不得不深入研究一下,因为如果您不知道在哪些列中遇到缺失值,某些答案将无法使用。也让我们看看哪个答案是最快的。

那么让我们创建一些测试数据:

import pandas as pd
import numpy as np

ser1 = pd.Series(np.random.rand(10_000)) # Generate random column
ser2 = ser1.copy(deep=True)              # Exact copy of values
ser3 = ser1.copy(deep=True)              # Exact copy of values
ser4 = pd.Series(np.random.rand(10_000)) # Different data

# Create independent nans
# ser1 without nans
ser2[np.random.rand(10_000) > 0.8] = np.nan 
ser3[np.random.rand(10_000) > 0.8] = np.nan
ser4[np.random.rand(10_000) > 0.8] = np.nan

当前答案作为采用 pd.Series(列类型)的函数:

# As a sanity check `Series.equals` that is not true for any_value == np.nan
def equality_pandas(a, b):
    return a.equals(b)

def equality_filling_one_sided(a, b):
    return a.equals(b.fillna(a))

def equality_filling_dummy_data(a, b):
    return a.fillna(-9999).equals(b.fillna(a).fillna(-9999))

def equality_boolean_mask(a, b):
    mask = a.notna() & b.notna()
    return a[mask].equals(b[mask])

def equality_pure_boolean(a, b):
    # using binary or operator to make it True if isna
    return ((a == b) | a.isna() | b.isna()).all()

让我们定义一些我们希望从忽略 NaN 并且不关心这些 NaN 是在左列还是右列中的一般比较函数的测试

def tests(equal):
    assert equal(ser1, ser1), "Identity has to be true without nan"
    assert equal(ser2, ser2), "Identity with nans at the same position"
    assert equal(ser1, ser2), "Same data, NaNs only on the right"
    assert equal(ser2, ser1), "Same data, NaNs only on the left"
    assert equal(ser2, ser3), "Same data but different NaNs"
    assert not equal(ser1, ser4), "Different data has to be not equal (NaNs only right)"
    assert not equal(ser2, ser4), "Different data has to be not equal (NaNs in both)"
    print("PASS")

运行 这些测试表明,仅在一个方向上填充不会使其具有可交换性(即使使用虚拟值,这通常可能是一种不好的做法)。请注意,如果 NaN 位于完全相同的位置,则 Series.equalsnp.nan == np.nanFalse returns True 的规则相反!

>>> tests(equality_pandas)

          2     assert equal(ser1, ser1), "Identity has to be true without nan"
          3     assert equal(ser2, ser2), "Identity with nans at the same position"
    ----> 4     assert equal(ser1, ser2), "Same data, NaNs only on the right"
          5     assert equal(ser2, ser1), "Same data, NaNs only on the left"
          6     assert equal(ser2, ser3), "Same data but different NaNs"


    AssertionError: Same data, NaNs only on the right


>>> tests(equality_filling_one_sided)

          3     assert equal(ser2, ser2), "Identity with nans at the same position"
          4     assert equal(ser1, ser2), "Same data, NaNs only on the right"
    ----> 5     assert equal(ser2, ser1), "Same data, NaNs only on the left"
          6     assert equal(ser2, ser3), "Same data but different NaNs"
          7     assert not equal(ser1, ser4), "Different data has to be not equal"


    AssertionError: Same data, NaNs only on the left


>>> tests(equality_filling_dummy_data)


          3     assert equal(ser2, ser2), "Identity with nans at the same position"
          4     assert equal(ser1, ser2), "Same data, NaNs only on the right"
    ----> 5     assert equal(ser2, ser1), "Same data, NaNs only on the left"
          6     assert equal(ser2, ser3), "Same data but different NaNs"
          7     assert not equal(ser1, ser4), "Different data has to be not equal"


    AssertionError: Same data, NaNs only on the left


>>> tests(equality_boolean_mask)
PASS

>>> tests(equality_pure_boolean)
PASS

性能

现在让我们快速看看哪种方法returns回答最快

%%timeit
equality_filling_one_sided(ser1, ser2)

    910 µs ± 19.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


%%timeit
equality_filling_dummy_data(ser1, ser2)

    1.27 ms ± 67.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


%%timeit
equality_boolean_mask(ser1, ser2)

    2.15 ms ± 94.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


%%timeit
equality_pure_boolean(ser1, ser2)

    1.34 ms ± 32.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

如您所见,布尔解决方案需要单独计算更多的中间结果,因此速度较慢,尽管它更接近于将其编写为优化的 C 代码时所编写的内容。

如果您知道只有正确的 NaN column/Series,您可以使用 equality_filling_one_sided 解决方案以获得最佳性能;

理想交换解

因此,如果我们希望交换比较忽略左侧和右侧的 NaN,最快的方法是使用:

def equality_filling_two_sided(a, b):
    f_a = a.fillna(b)
    f_b = b.fillna(a)
    return f_a.equals(f_b)
>>> tests(equality_filling_two_sided)
PASS


%%timeit
equality_filling_two_sided(ser1, ser2)

    962 µs ± 35.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

这比单方面的解决方案慢得可以忽略不计,但满足所有要求