如何更改 Panda Dataframe，使值严格介于 0 和 1 之间，同时保留 NaN？

Question

我正在研究 Kaggle Titanic 问题。我有一个功能，可以根据乘客的特征创建生存方式的交叉产品。对于 Embarked 的 SibSp，我得到以下生存率 table:

import pandas as pd
import numpy as np
data = [[0.5,0.657,0.75, np.NaN, np.NaN, np.NaN, np.NaN,0.556],
[0.372,0.375,0.667, np.NaN,0, np.NaN, np.NaN,0.362],
[0.302,0.438,0.375,0.364,0.3,0,0,0.336],
[0.343,0.506,0.478,0.364,0.214,0,0,0.377]]   
df_m = pd.DataFrame(data, columns=[0,1,2,3,4,5,8,'All'],
                      index = ['C', 'Q', 'A', 'All'])

所以我开始的转置是：

Embarked         C         Q         S       All
SibSp                                           
0         0.500000  0.372093  0.302115  0.342920
1         0.657143  0.375000  0.468468  0.506494
2         0.750000  0.666667  0.375000  0.478261
3              NaN       NaN  0.363636  0.363636
4              NaN  0.000000  0.300000  0.214286
5              NaN       NaN  0.000000  0.000000
8              NaN       NaN  0.000000  0.000000
All       0.555556  0.362069  0.336049  0.376877

虽然我想要的端点是这样的：

Embarked         C         Q         S       All
SibSp                                           
0         0.500000  0.372093  0.302115  0.342920
1         0.657143  0.375000  0.468468  0.506494
2         0.750000  0.666667  0.375000  0.478261
3              NaN       NaN  0.363636  0.363636
4              NaN  0.000100  0.300000  0.214286
5              NaN       NaN  0.000100  0.000100
8              NaN       NaN  0.000100  0.000100
All       0.555556  0.362069  0.336049  0.376877

我想将比率严格限制在 0 到 1 之间，同时保留 NaN。我已经尝试了两种循环方式：

for i in df_m.columns:
    for j in df_m.index:
        p_hat.at[i, j] = max(min(df_m[i, j], 0.999), 0.001)

并将最后一行中的“.at”替换为“.loc”。这两种方法都从第一列和索引中抛出 KeyError: (0, 'C').

我采用的另一种方法是连接并采用 max(value, .001) 和 min(value, .999):

smalls = pd.DataFrame(0.001*np.ones(df_m.shape)) 
bigs   = pd.DataFrame(0.999*np.ones(df_m.shape)) 
smalls.columns = df_m.columns
bigs.columns = df_m.columns
smalls.index = df_m.index
bigs.index = df_m.index
p_hat1 = pd.concat([df_m, bigs]).groupby(level=0).min()
p_hat  = pd.concat([p_hat1, smalls]).groupby(level=0).max()

这具有将 NaN 转换为 0.999 的副作用。在稍后的步骤中，我想结合比率和计数并计算 95% 的置信区间以进行绘图。在那个阶段，我不想显示 NaN。

提前致谢。

Answer 1

尝试：

df_m[df_m.eq(0)] = 0.0001
print(df_m.T)

# Output
         C       Q       A     All
0    0.500  0.3720  0.3020  0.3430
1    0.657  0.3750  0.4380  0.5060
2    0.750  0.6670  0.3750  0.4780
3      NaN     NaN  0.3640  0.3640
4      NaN  0.0001  0.3000  0.2140
5      NaN     NaN  0.0001  0.0001
8      NaN     NaN  0.0001  0.0001
All  0.556  0.3620  0.3360  0.3770

更新

It doesn't show in this example but I also replace values of 1.0 with 0.999

更喜欢clip

df_m = df_m.clip(lower=0.001, upper=0.999)

如何更改 Panda Dataframe，使值严格介于 0 和 1 之间，同时保留 NaN？

How can I change a Panda Dataframe so that the values are strictly between 0 and 1 while preserving the NaNs?

dataframe

python-3.x

pandas

pandas-groupby