使用 Pandas `where` 方法时,如何避免删除带有 NaN 的行?

How can I avoid dropping rows with NaNs when using Pandas `where` method?

我 运行 在使用 Pandas where 方法时遇到了问题。具体来说,我使用 where 来识别数据框中满足特定条件的行。如果满足这些条件,where 方法会正确地将 NaN 分配给这些值。我遇到的问题是某些行在执行 where 方法之前已经包含 NaN 值的情况。它们没有完整保留这些值,而是被删除,并且我的数据框以一种不受欢迎和意外的方式被更改。我该如何纠正?

import numpy as np
import pandas as pd

High = {'High': np.array([126.93000031, 126.98999786, 124.91999817, 127.72000122,
       128.        , 127.94000244, 128.32000732, 127.38999939,
       127.63999939, 125.80000305, 125.34999847, 125.23999786,
       124.84999847, 126.16000366, 126.31999969, 128.46000671,
       127.75      ])}
Low = {'Low': np.array([125.16999817, 124.77999878, 122.86000061, 125.09999847,
       125.20999908, 125.94000244, 126.31999969, 126.41999817,
       125.08000183, 124.55000305, 123.94000244, 124.05000305,
       123.12999725, 123.84999847, 124.83000183, 126.20999908,
       126.51999664])}       
Close = {'Close': np.array([126.26999664, 124.84999847, 124.69000244, 127.30999756,
       125.43000031, 127.09999847, 126.90000153, 126.84999847,
       125.27999878, 124.61000061, 124.27999878, 125.05999756,
       123.54000092, 125.88999939, 125.90000153, 126.73999786,
       127.12999725])}        
index = pd.date_range(start = '2021-05-17', periods = 17)
df = pd.DataFrame(dict(High, **Low, **Close), index = index)

pos = {'pos': np.array([np.nan, np.nan,  1.,  1., -1., -1., -1., -1., -1., -1., -1.,  1.,  1.,
        1.,  1.,  1.,  1.])}
stop = {'stop': np.array([         np.nan,          np.nan, 122.86000061, 122.86000061,
       128.        , 128.        , 128.        , 128.        ,
       128.        , 125.80000305, 125.80000305, 124.05000305,
       124.05000305, 123.84999847, 123.84999847, 123.84999847,
       123.84999847])}
s = pd.DataFrame(dict(pos, **stop), index = index)

grouped = s.groupby(['pos','stop'])
grouped1 = grouped.apply(
    lambda g: g.where(
    (s['pos'] == 1) & (s['stop'] <= df['Low']) |
    (s['pos'] == -1) & (s['stop'] >= df['High']) 
    ))

s

            pos        stop
2021-05-17  NaN         NaN
2021-05-18  NaN         NaN
2021-05-19  1.0  122.860001
2021-05-20  1.0  122.860001
2021-05-21 -1.0  128.000000
2021-05-22 -1.0  128.000000
2021-05-23 -1.0  128.000000
2021-05-24 -1.0  128.000000
2021-05-25 -1.0  128.000000
2021-05-26 -1.0  125.800003
2021-05-27 -1.0  125.800003
2021-05-28  1.0  124.050003
2021-05-29  1.0  124.050003
2021-05-30  1.0  123.849998
2021-05-31  1.0  123.849998
2021-06-01  1.0  123.849998
2021-06-02  1.0  123.849998

grouped1

            pos        stop
2021-05-19  1.0  122.860001
2021-05-20  1.0  122.860001
2021-05-21 -1.0  128.000000
2021-05-22 -1.0  128.000000
2021-05-23  NaN         NaN
2021-05-24 -1.0  128.000000
2021-05-25 -1.0  128.000000
2021-05-26 -1.0  125.800003
2021-05-27 -1.0  125.800003
2021-05-28  1.0  124.050003
2021-05-29  NaN         NaN
2021-05-30  1.0  123.849998
2021-05-31  1.0  123.849998
2021-06-01  1.0  123.849998
2021-06-02  1.0  123.849998

问题是 grouped1 数据框现在缺少与 2021-05-172021-05-18 索引关联的 s 数据框的前两行。我对 where 方法有什么误解还是这是一个错误?产生以下所需结果的最佳替代方法是什么?

grouped1
            pos        stop
2021-05-17  NaN         NaN
2021-05-18  NaN         NaN
2021-05-19  1.0  122.860001
2021-05-20  1.0  122.860001
2021-05-21 -1.0  128.000000
2021-05-22 -1.0  128.000000
2021-05-23  NaN         NaN
2021-05-24 -1.0  128.000000
2021-05-25 -1.0  128.000000
2021-05-26 -1.0  125.800003
2021-05-27 -1.0  125.800003
2021-05-28  1.0  124.050003
2021-05-29  NaN         NaN
2021-05-30  1.0  123.849998
2021-05-31  1.0  123.849998
2021-06-01  1.0  123.849998
2021-06-02  1.0  123.849998

一个解决方法是用一些您永远不会得到的值填充您的 NaNs,例如 -999。那么这些行肯定不会满足您在 np.where 中的条件,并且会在您生成的 grouped1 DataFrame:

中填充 NaN
grouped = s.fillna(-999).groupby(['pos','stop'])
grouped1 = grouped.apply(
    lambda g: g.where(
    (s['pos'] == 1) & (s['stop'] <= df['Low']) |
    (s['pos'] == -1) & (s['stop'] >= df['High']) 
    ))

结果:

>>> grouped1
            pos        stop
2021-05-17  NaN         NaN
2021-05-18  NaN         NaN
2021-05-19  1.0  122.860001
2021-05-20  1.0  122.860001
2021-05-21 -1.0  128.000000
2021-05-22 -1.0  128.000000
2021-05-23  NaN         NaN
2021-05-24 -1.0  128.000000
2021-05-25 -1.0  128.000000
2021-05-26 -1.0  125.800003
2021-05-27 -1.0  125.800003
2021-05-28  1.0  124.050003
2021-05-29  NaN         NaN
2021-05-30  1.0  123.849998
2021-05-31  1.0  123.849998
2021-06-01  1.0  123.849998
2021-06-02  1.0  123.849998

对于那些感兴趣的人:最初我认为 s.groupby(['pos','stop'], dropna=False) 应该处理 NaN 因为

for _,df_group in s.groupby(['pos','stop']): print(df_group)

显示包括 NaN 在内的所有组。但是,当您添加 .apply 时,任何带有 NaN 的行都会再次被删除。例如,none 的 NaN 行在您 运行:

时出现
s.groupby(['pos','stop'], dropna=False).apply(lambda g: g)

我原以为这会 return 所有行,包括带有 NaN 的行。我猜这可能是因为 np.nan != np.nan 所以当我们使用 .apply 时,NaN 不知何故被丢弃了。