比较具有 NaN 或 <NA> 值 pandas 的列

Question

我有包含 NaN 和值的数据框，现在我想比较同一数据框中的两列，无论每行值是否为空。例如，

如果列 a_1 有空值，列 a_2 没有空值，那么对于那个特定的行，结果应该是新列中的 1 a_12.
如果a_1(value is 123) & a_2(value is 345)中的值都不为null，则值为不相等，则结果应为 a_12.

下面是我用来比较的代码片段，对于场景 1，我得到的结果是 3 而不是 1。请指导我获得正确的输出。

    try:
        if (x[cols[0]]==x[cols[1]]) & (~np.isnan(x[cols[0]])) & (~np.isnan(x[cols[1]])):   
            return 0

        elif (np.isnan(x[cols[0]])) & (np.isnan(x[cols[1]])):
            return 0

        elif (~np.isnan(x[cols[0]])) & (np.isnan(x[cols[1]])):
            return 1

        elif (np.isnan(x[cols[0]])) & (~np.isnan(x[cols[1]])):
            return 2

        elif (x[cols[0]]!=x[cols[1]]) & (~np.isnan(x[cols[0]])) & (~np.isnan(x[cols[1]])):
            return 3
        else:
            pass

    except Exception as exc:
        if (x[cols[0]]==x[cols[1]]) & (pd.notna(x[cols[0]])) & (pd.notna(x[cols[1]])):   
            return 0

        elif (pd.isna(x[cols[0]])) & (pd.isna(x[cols[1]])):
            return 0

        elif (pd.notna(x[cols[0]])) & (pd.isna(x[cols[1]])):
            return 1

        elif (pd.isna(x[cols[0]])) & (pd.notna(x[cols[1]])):
            return 2

        elif (x[cols[0]]!=x[cols[1]]) & (pd.notna(x[cols[0]])) & (pd.notna(x[cols[1]])):
            return 3
        else:
            pass

我使用了 pd.isna() 和 pd.notna()，还有 np.isnan() 和 ~np.isnan()，因为对于某些列，第二种方法 ( np.isnan()) 正在工作，对于某些列，它只是抛出一个错误。

请指导我如何达到例外的结果。

预期输出：

| a_1       | a_2     | result |
|-----------|---------|--------|
| gssfwe    | gssfwe  |   0    |
| <NA>      | <NA>    |   0    |
| fsfsfw    | <NA>    |   1    |
| <NA>      | qweweqw |   2    |
| adsadgsgd | wwuwquq |   3    |

以上代码得到的输出：

| a_1       | a_2     | result |
|-----------|---------|--------|
| gssfwe    | gssfwe  |   0    |
| <NA>      | <NA>    |   0    |
| fsfsfw    | <NA>    |   3    |
| <NA>      | qweweqw |   3    |
| adsadgsgd | wwuwquq |   3    |

Answer 1

按照您代码中的逻辑，您需要定义一个函数并将其应用于您的 DataFrame。

df = pd.DataFrame({'a_1': [1, 2, np.nan, np.nan, 1], 'a_2': [2, np.nan, 1, np.nan, 1]})

您想要的类别可以整齐地映射到二进制数，您可以使用它来编写像 -

这样的短函数

def nan_check(row):
    x, y = row
    if x != y:
        return int(f'{int(pd.notna(y))}{int(pd.notna(x))}', base=2)
    return 0

df['flag'] = df.apply(nan_check, axis=1)

输出

   a_1  a_2  flag
0  1.0  2.0     3
1  2.0  NaN     1
2  NaN  1.0     2
3  NaN  NaN     0
4  1.0  1.0     0

Answer 2

您可以尝试 np.select，但我认为您需要重新考虑条件和预期输出

Condition 1: if the column a_1 have null values, column a_2 have not null values, then for that particular row, the result should be 1 in the new column a_12.

Condition 2: If the values in both a_1 & a_2 is not null, and the values are not equal, then the result should be 3 in column a_12.

df['a_12'] = np.select(
    [df['a_1'].isna() & df['a_2'].notna(),
     df['a_1'].notna() & df['a_2'].notna() & df['a_1'].ne(df['a_2'])],
    [1, 3],
    default=0
)

print(df)

         a_1      a_2  result  a_12
0     gssfwe   gssfwe       0     0
1        NaN      NaN       0     0
2     fsfsfw      NaN       1     0   # Shouldn't be Condition 1 since a_1 is not NaN
3        NaN  qweweqw       2     1   # Condition 1
4  adsadgsgd  wwuwquq       3     3

比较具有 NaN 或 <NA> 值 pandas 的列

compare columns with NaN or <NA> values pandas

nan

dataframe

python-3.x

pandas

na