使用 Pandas `where` 方法时,如何避免删除带有 NaN 的行?
How can I avoid dropping rows with NaNs when using Pandas `where` method?
我 运行 在使用 Pandas where
方法时遇到了问题。具体来说,我使用 where
来识别数据框中满足特定条件的行。如果满足这些条件,where
方法会正确地将 NaN
分配给这些值。我遇到的问题是某些行在执行 where
方法之前已经包含 NaN
值的情况。它们没有完整保留这些值,而是被删除,并且我的数据框以一种不受欢迎和意外的方式被更改。我该如何纠正?
import numpy as np
import pandas as pd
High = {'High': np.array([126.93000031, 126.98999786, 124.91999817, 127.72000122,
128. , 127.94000244, 128.32000732, 127.38999939,
127.63999939, 125.80000305, 125.34999847, 125.23999786,
124.84999847, 126.16000366, 126.31999969, 128.46000671,
127.75 ])}
Low = {'Low': np.array([125.16999817, 124.77999878, 122.86000061, 125.09999847,
125.20999908, 125.94000244, 126.31999969, 126.41999817,
125.08000183, 124.55000305, 123.94000244, 124.05000305,
123.12999725, 123.84999847, 124.83000183, 126.20999908,
126.51999664])}
Close = {'Close': np.array([126.26999664, 124.84999847, 124.69000244, 127.30999756,
125.43000031, 127.09999847, 126.90000153, 126.84999847,
125.27999878, 124.61000061, 124.27999878, 125.05999756,
123.54000092, 125.88999939, 125.90000153, 126.73999786,
127.12999725])}
index = pd.date_range(start = '2021-05-17', periods = 17)
df = pd.DataFrame(dict(High, **Low, **Close), index = index)
pos = {'pos': np.array([np.nan, np.nan, 1., 1., -1., -1., -1., -1., -1., -1., -1., 1., 1.,
1., 1., 1., 1.])}
stop = {'stop': np.array([ np.nan, np.nan, 122.86000061, 122.86000061,
128. , 128. , 128. , 128. ,
128. , 125.80000305, 125.80000305, 124.05000305,
124.05000305, 123.84999847, 123.84999847, 123.84999847,
123.84999847])}
s = pd.DataFrame(dict(pos, **stop), index = index)
grouped = s.groupby(['pos','stop'])
grouped1 = grouped.apply(
lambda g: g.where(
(s['pos'] == 1) & (s['stop'] <= df['Low']) |
(s['pos'] == -1) & (s['stop'] >= df['High'])
))
s
pos stop
2021-05-17 NaN NaN
2021-05-18 NaN NaN
2021-05-19 1.0 122.860001
2021-05-20 1.0 122.860001
2021-05-21 -1.0 128.000000
2021-05-22 -1.0 128.000000
2021-05-23 -1.0 128.000000
2021-05-24 -1.0 128.000000
2021-05-25 -1.0 128.000000
2021-05-26 -1.0 125.800003
2021-05-27 -1.0 125.800003
2021-05-28 1.0 124.050003
2021-05-29 1.0 124.050003
2021-05-30 1.0 123.849998
2021-05-31 1.0 123.849998
2021-06-01 1.0 123.849998
2021-06-02 1.0 123.849998
grouped1
pos stop
2021-05-19 1.0 122.860001
2021-05-20 1.0 122.860001
2021-05-21 -1.0 128.000000
2021-05-22 -1.0 128.000000
2021-05-23 NaN NaN
2021-05-24 -1.0 128.000000
2021-05-25 -1.0 128.000000
2021-05-26 -1.0 125.800003
2021-05-27 -1.0 125.800003
2021-05-28 1.0 124.050003
2021-05-29 NaN NaN
2021-05-30 1.0 123.849998
2021-05-31 1.0 123.849998
2021-06-01 1.0 123.849998
2021-06-02 1.0 123.849998
问题是 grouped1
数据框现在缺少与 2021-05-17
和 2021-05-18
索引关联的 s
数据框的前两行。我对 where
方法有什么误解还是这是一个错误?产生以下所需结果的最佳替代方法是什么?
grouped1
pos stop
2021-05-17 NaN NaN
2021-05-18 NaN NaN
2021-05-19 1.0 122.860001
2021-05-20 1.0 122.860001
2021-05-21 -1.0 128.000000
2021-05-22 -1.0 128.000000
2021-05-23 NaN NaN
2021-05-24 -1.0 128.000000
2021-05-25 -1.0 128.000000
2021-05-26 -1.0 125.800003
2021-05-27 -1.0 125.800003
2021-05-28 1.0 124.050003
2021-05-29 NaN NaN
2021-05-30 1.0 123.849998
2021-05-31 1.0 123.849998
2021-06-01 1.0 123.849998
2021-06-02 1.0 123.849998
一个解决方法是用一些您永远不会得到的值填充您的 NaNs
,例如 -999
。那么这些行肯定不会满足您在 np.where
中的条件,并且会在您生成的 grouped1
DataFrame:
中填充 NaN
grouped = s.fillna(-999).groupby(['pos','stop'])
grouped1 = grouped.apply(
lambda g: g.where(
(s['pos'] == 1) & (s['stop'] <= df['Low']) |
(s['pos'] == -1) & (s['stop'] >= df['High'])
))
结果:
>>> grouped1
pos stop
2021-05-17 NaN NaN
2021-05-18 NaN NaN
2021-05-19 1.0 122.860001
2021-05-20 1.0 122.860001
2021-05-21 -1.0 128.000000
2021-05-22 -1.0 128.000000
2021-05-23 NaN NaN
2021-05-24 -1.0 128.000000
2021-05-25 -1.0 128.000000
2021-05-26 -1.0 125.800003
2021-05-27 -1.0 125.800003
2021-05-28 1.0 124.050003
2021-05-29 NaN NaN
2021-05-30 1.0 123.849998
2021-05-31 1.0 123.849998
2021-06-01 1.0 123.849998
2021-06-02 1.0 123.849998
对于那些感兴趣的人:最初我认为 s.groupby(['pos','stop'], dropna=False)
应该处理 NaN 因为
for _,df_group in s.groupby(['pos','stop']): print(df_group)
显示包括 NaN 在内的所有组。但是,当您添加 .apply 时,任何带有 NaN 的行都会再次被删除。例如,none 的 NaN 行在您 运行:
时出现
s.groupby(['pos','stop'], dropna=False).apply(lambda g: g)
我原以为这会 return 所有行,包括带有 NaN 的行。我猜这可能是因为 np.nan != np.nan
所以当我们使用 .apply
时,NaN 不知何故被丢弃了。
我 运行 在使用 Pandas where
方法时遇到了问题。具体来说,我使用 where
来识别数据框中满足特定条件的行。如果满足这些条件,where
方法会正确地将 NaN
分配给这些值。我遇到的问题是某些行在执行 where
方法之前已经包含 NaN
值的情况。它们没有完整保留这些值,而是被删除,并且我的数据框以一种不受欢迎和意外的方式被更改。我该如何纠正?
import numpy as np
import pandas as pd
High = {'High': np.array([126.93000031, 126.98999786, 124.91999817, 127.72000122,
128. , 127.94000244, 128.32000732, 127.38999939,
127.63999939, 125.80000305, 125.34999847, 125.23999786,
124.84999847, 126.16000366, 126.31999969, 128.46000671,
127.75 ])}
Low = {'Low': np.array([125.16999817, 124.77999878, 122.86000061, 125.09999847,
125.20999908, 125.94000244, 126.31999969, 126.41999817,
125.08000183, 124.55000305, 123.94000244, 124.05000305,
123.12999725, 123.84999847, 124.83000183, 126.20999908,
126.51999664])}
Close = {'Close': np.array([126.26999664, 124.84999847, 124.69000244, 127.30999756,
125.43000031, 127.09999847, 126.90000153, 126.84999847,
125.27999878, 124.61000061, 124.27999878, 125.05999756,
123.54000092, 125.88999939, 125.90000153, 126.73999786,
127.12999725])}
index = pd.date_range(start = '2021-05-17', periods = 17)
df = pd.DataFrame(dict(High, **Low, **Close), index = index)
pos = {'pos': np.array([np.nan, np.nan, 1., 1., -1., -1., -1., -1., -1., -1., -1., 1., 1.,
1., 1., 1., 1.])}
stop = {'stop': np.array([ np.nan, np.nan, 122.86000061, 122.86000061,
128. , 128. , 128. , 128. ,
128. , 125.80000305, 125.80000305, 124.05000305,
124.05000305, 123.84999847, 123.84999847, 123.84999847,
123.84999847])}
s = pd.DataFrame(dict(pos, **stop), index = index)
grouped = s.groupby(['pos','stop'])
grouped1 = grouped.apply(
lambda g: g.where(
(s['pos'] == 1) & (s['stop'] <= df['Low']) |
(s['pos'] == -1) & (s['stop'] >= df['High'])
))
s
pos stop
2021-05-17 NaN NaN
2021-05-18 NaN NaN
2021-05-19 1.0 122.860001
2021-05-20 1.0 122.860001
2021-05-21 -1.0 128.000000
2021-05-22 -1.0 128.000000
2021-05-23 -1.0 128.000000
2021-05-24 -1.0 128.000000
2021-05-25 -1.0 128.000000
2021-05-26 -1.0 125.800003
2021-05-27 -1.0 125.800003
2021-05-28 1.0 124.050003
2021-05-29 1.0 124.050003
2021-05-30 1.0 123.849998
2021-05-31 1.0 123.849998
2021-06-01 1.0 123.849998
2021-06-02 1.0 123.849998
grouped1
pos stop
2021-05-19 1.0 122.860001
2021-05-20 1.0 122.860001
2021-05-21 -1.0 128.000000
2021-05-22 -1.0 128.000000
2021-05-23 NaN NaN
2021-05-24 -1.0 128.000000
2021-05-25 -1.0 128.000000
2021-05-26 -1.0 125.800003
2021-05-27 -1.0 125.800003
2021-05-28 1.0 124.050003
2021-05-29 NaN NaN
2021-05-30 1.0 123.849998
2021-05-31 1.0 123.849998
2021-06-01 1.0 123.849998
2021-06-02 1.0 123.849998
问题是 grouped1
数据框现在缺少与 2021-05-17
和 2021-05-18
索引关联的 s
数据框的前两行。我对 where
方法有什么误解还是这是一个错误?产生以下所需结果的最佳替代方法是什么?
grouped1
pos stop
2021-05-17 NaN NaN
2021-05-18 NaN NaN
2021-05-19 1.0 122.860001
2021-05-20 1.0 122.860001
2021-05-21 -1.0 128.000000
2021-05-22 -1.0 128.000000
2021-05-23 NaN NaN
2021-05-24 -1.0 128.000000
2021-05-25 -1.0 128.000000
2021-05-26 -1.0 125.800003
2021-05-27 -1.0 125.800003
2021-05-28 1.0 124.050003
2021-05-29 NaN NaN
2021-05-30 1.0 123.849998
2021-05-31 1.0 123.849998
2021-06-01 1.0 123.849998
2021-06-02 1.0 123.849998
一个解决方法是用一些您永远不会得到的值填充您的 NaNs
,例如 -999
。那么这些行肯定不会满足您在 np.where
中的条件,并且会在您生成的 grouped1
DataFrame:
NaN
grouped = s.fillna(-999).groupby(['pos','stop'])
grouped1 = grouped.apply(
lambda g: g.where(
(s['pos'] == 1) & (s['stop'] <= df['Low']) |
(s['pos'] == -1) & (s['stop'] >= df['High'])
))
结果:
>>> grouped1
pos stop
2021-05-17 NaN NaN
2021-05-18 NaN NaN
2021-05-19 1.0 122.860001
2021-05-20 1.0 122.860001
2021-05-21 -1.0 128.000000
2021-05-22 -1.0 128.000000
2021-05-23 NaN NaN
2021-05-24 -1.0 128.000000
2021-05-25 -1.0 128.000000
2021-05-26 -1.0 125.800003
2021-05-27 -1.0 125.800003
2021-05-28 1.0 124.050003
2021-05-29 NaN NaN
2021-05-30 1.0 123.849998
2021-05-31 1.0 123.849998
2021-06-01 1.0 123.849998
2021-06-02 1.0 123.849998
对于那些感兴趣的人:最初我认为 s.groupby(['pos','stop'], dropna=False)
应该处理 NaN 因为
for _,df_group in s.groupby(['pos','stop']): print(df_group)
显示包括 NaN 在内的所有组。但是,当您添加 .apply 时,任何带有 NaN 的行都会再次被删除。例如,none 的 NaN 行在您 运行:
时出现s.groupby(['pos','stop'], dropna=False).apply(lambda g: g)
我原以为这会 return 所有行,包括带有 NaN 的行。我猜这可能是因为 np.nan != np.nan
所以当我们使用 .apply
时,NaN 不知何故被丢弃了。