Dataframe 使用条件填充 NaN 平均值

Dataframe fill NaN average values using conditions

Pandas 的新手,我下载了一些 public COVID 数据,现在我正在尝试填充 NaN 值。我的数据框看起来像这样:

现在我试图在 new_cases 列中填写 NaN 值。起初我使用 :

将列的平均值添加到这些值
df['new_cases'] = df['new_cases'].fillna(value=df.new_cases.mean().astype(int), inplace=True)

然而,有不同的国家,几周所以在考虑之后,平均值并不能真正提供信息。 我想做的是取同一国家/地区前后一周的平均值。这会更有意义,但我不知道该怎么做。

我想这就是您要找的。如您所说,您想要按国家和周将 nan 替换为前一个和下一个 new_cases 值的平均值。因此,您的数据如下所示:

index  country country_code year_week  source  new_cases  number_sequenced  \
0      0  Austria           AT   2020-40  GISAID     5152.0                 4   
1      1  Austria           AT   2020-40  GISAID     5152.0                 4   
2      2  Austria           AT   2020-40  GISAID     5152.0                 4   
3      3  Austria           AT   2020-40  GISAID     5152.0                 4   
4      4  Austria           AT   2020-40  GISAID     5152.0                 4   
5      5  Austria           AT   2020-40  GISAID     5152.0                 4   
6      6  Austria           AT   2020-40  GISAID     5152.0                 4   
7      7  Austria           AT   2020-40  GISAID     5152.0                 4   
8      8  Austria           AT   2020-40  GISAID     5152.0                 4   
9      9  Austria           AT   2020-40  GISAID     5152.0                 4   

   percent_cases_sequenced valid_denominator          variant  \
0                      0.1               Yes          B.1.1.7   
1                      0.1               Yes    B.1.1.7+E484K   
2                      0.1               Yes          B.1.351   
3                      0.1               Yes  B.1.427/B.1.429   
4                      0.1               Yes          B.1.525   
5                      0.1               Yes          B.1.526   
6                      0.1               Yes          B.1.616   
7                      0.1               Yes          B.1.617   
8                      0.1               Yes        B.1.617.1   
9                      0.1               Yes        B.1.617.2   

   number_detections_variant  percent_variant  
0                          0              0.0  
1                          0              0.0  
2                          0              0.0  
3                          0              0.0  
4                          0              0.0  
5                          0              0.0  
6                          0              0.0  
7                          0              0.0  
8                          0              0.0  
9                          0              0.0  

为了知道要寻找什么,我查看了 nan 所在的位置:

df_null = df[df['new_cases'].isna()]
df_null

即:

country country_code year_week  source  new_cases  number_sequenced  \
24202   Spain           ES   2021-08  GISAID        NaN              1195   
24203   Spain           ES   2021-08  GISAID        NaN              1195   
24204   Spain           ES   2021-08  GISAID        NaN              1195   
24205   Spain           ES   2021-08  GISAID        NaN              1195   
24206   Spain           ES   2021-08  GISAID        NaN              1195   
24207   Spain           ES   2021-08  GISAID        NaN              1195   
24208   Spain           ES   2021-08  GISAID        NaN              1195   
24209   Spain           ES   2021-08  GISAID        NaN              1195   
24210   Spain           ES   2021-08  GISAID        NaN              1195   
24211   Spain           ES   2021-08  GISAID        NaN              1195   
24212   Spain           ES   2021-08  GISAID        NaN              1195   
24213   Spain           ES   2021-08  GISAID        NaN              1195   
24214   Spain           ES   2021-08  GISAID        NaN              1195   
24215   Spain           ES   2021-08  GISAID        NaN              1195   
24216   Spain           ES   2021-08  GISAID        NaN              1195   
24217   Spain           ES   2021-08  GISAID        NaN              1195   
24218   Spain           ES   2021-08  GISAID        NaN              1195   
24629   Spain           ES   2021-08   TESSy        NaN               998   
24630   Spain           ES   2021-08   TESSy        NaN               998   
24631   Spain           ES   2021-08   TESSy        NaN               998   
24632   Spain           ES   2021-08   TESSy        NaN               998   
24633   Spain           ES   2021-08   TESSy        NaN               998   
24634   Spain           ES   2021-08   TESSy        NaN               998   

       percent_cases_sequenced valid_denominator          variant  \
24202                      NaN               Yes          B.1.1.7   
24203                      NaN               Yes    B.1.1.7+E484K   
24204                      NaN               Yes          B.1.351   
24205                      NaN               Yes  B.1.427/B.1.429   
24206                      NaN               Yes          B.1.525   
24207                      NaN               Yes          B.1.526   
24208                      NaN               Yes          B.1.616   
24209                      NaN               Yes          B.1.617   
24210                      NaN               Yes        B.1.617.1   
24211                      NaN               Yes        B.1.617.2   
24212                      NaN               Yes        B.1.617.3   
24213                      NaN               Yes          B.1.620   
24214                      NaN               Yes          B.1.621   
24215                      NaN               Yes             C.37   
24216                      NaN               Yes              P.1   
24217                      NaN               Yes              P.3   
24218                      NaN               Yes            Other   
24629                      NaN               Yes          B.1.1.7   
24630                      NaN               Yes          B.1.351   
24631                      NaN               Yes          B.1.525   
24632                      NaN               Yes              P.1   
24633                      NaN               Yes              UNK   
24634                      NaN               Yes            Other   

       number_detections_variant  percent_variant  
24202                        703             58.8  
24203                          0              0.0  
24204                         10              0.8  
24205                          1              0.1  
24206                          2              0.2  
24207                          0              0.0  
24208                          0              0.0  
24209                          0              0.0  
24210                          0              0.0  
24211                          0              0.0  
24212                          0              0.0  
24213                          0              0.0  
24214                          0              0.0  
24215                          0              0.0  
24216                          2              0.2  
24217                          0              0.0  
24218                        477             39.9  
24629                        682             68.3  
24630                          4              0.4  
24631                          1              0.1  
24632                          2              0.2  
24633                         65              6.5  
24634                        244             24.4 

要填充这些地方,您需要执行以下操作:

df2 = pd.concat([df.ffill(), df.bfill()]).groupby(['country','year_week']).mean()

其中,要填写的数据给出:

country year_week  new_cases  number_sequenced  percent_cases_sequenced  \
1176   Spain   2020-40    65146.0        224.500000                 0.340000   
1177   Spain   2020-41    75556.0        216.800000                 0.270000   
1178   Spain   2020-42    85481.0        310.000000                 0.323810   
1179   Spain   2020-43   123871.0        195.736842                 0.178947   
1180   Spain   2020-44   142377.0        134.250000                 0.085000   
1181   Spain   2020-45   140521.0        171.200000                 0.085000   
1182   Spain   2020-46   115646.0        176.210526                 0.178947   
1183   Spain   2020-47    85752.0        170.700000                 0.170000   
1184   Spain   2020-48    65571.0        125.250000                 0.185000   
1185   Spain   2020-49    54141.0        295.105263                 0.568421   
1186   Spain   2020-50    49556.0        390.900000                 0.755000   
1187   Spain   2020-51    67365.0        485.300000                 0.755000   
1188   Spain   2020-52    60164.0        383.523810                 0.638095   
1189   Spain   2020-53    79431.0        629.150000                 0.770000   
1190   Spain   2021-01   152938.0        601.250000                 0.400000   
1191   Spain   2021-02   224669.0        966.095238                 0.400000   
1192   Spain   2021-03   256931.0       1400.714286                 0.561905   
1193   Spain   2021-04   229423.0       1358.826087                 0.547826   
1194   Spain   2021-05   166280.0       1380.952381                 0.842857   
1195   Spain   2021-06    97201.0       1136.142857                 1.142857   
1196   Spain   2021-07    67685.0       1259.333333                 1.866667   
1197   Spain   2021-08    51494.0       1143.608696                 2.617391   
1198   Spain   2021-09    35303.0       1226.318182                 3.440909   
1199   Spain   2021-10    34092.0       1236.125000                 3.612500   
1200   Spain   2021-11    33741.0       1145.000000                 3.404167   
1201   Spain   2021-12    42022.0       1450.208333                 3.445833   
1202   Spain   2021-13    40500.0       1408.583333                 3.475000   
1203   Spain   2021-14    58931.0       1326.307692                 2.234615   
1204   Spain   2021-15    58098.0       1540.391304                 2.673913   
1205   Spain   2021-16    60115.0       1168.538462                 1.938462   
1206   Spain   2021-17    51961.0       1286.925926                 2.440741   
1207   Spain   2021-18    40962.0       1326.555556                 3.229630   
1208   Spain   2021-19    34468.0       1184.653846                 3.457692   
1209   Spain   2021-20    31660.0       1053.846154                 3.319231   
1210   Spain   2021-21    30870.0       1095.269231                 3.550000   
1211   Spain   2021-22    29133.0       1071.730769                 3.650000   
1212   Spain   2021-23    34244.0        852.923077                 2.476923   
1213   Spain   2021-24    22884.0        727.461538                 3.176923   
1214   Spain   2021-25    27991.0        637.200000                 2.284000   
1215   Spain   2021-26    73833.0        475.384615                 0.607692   
1216   Spain   2021-27   104649.0         42.695652                 0.026087   
1217   Spain   2021-28   190726.0          0.300000                 0.000000   

      number_detections_variant  percent_variant  
1176                  14.400000         9.995000  
1177                  15.800000        10.000000  
1178                  19.000000         9.528571  
1179                  12.210526        10.526316  
1180                   9.050000        10.005000  
1181                  11.800000         9.995000  
1182                  10.736842        10.526316  
1183                  10.700000         9.995000  
1184                   8.850000         9.995000  
1185                  24.000000        10.526316  
1186                  33.000000        10.000000  
1187                  43.000000        10.000000  
1188                  39.238095         9.528571  
1189                  55.950000        10.000000  
1190                  58.550000         9.995000  
1191                  85.523810         9.523810  
1192                 117.571429         9.519048  
1193                 108.739130         8.700000  
1194                 117.428571         9.519048  
1195                  99.095238         9.528571  
1196                 105.904762         9.523810  
1197                  95.347826         8.691304  
1198                 104.863636         9.100000  
1199                 101.291667         8.341667  
1200                  99.583333         8.337500  
1201                 124.375000         8.329167  
1202                 122.000000         8.333333  
1203                 102.923077         7.696154  
1204                 133.739130         8.700000  
1205                  92.538462         7.696154  
1206                  91.851852         7.407407  
1207                  94.000000         7.411111  
1208                  88.346154         7.688462  
1209                  86.461538         7.692308  
1210                  88.807692         7.688462  
1211                  86.807692         7.684615  
1212                  68.615385         7.684615  
1213                  57.923077         7.692308  
1214                  55.800000         7.996000  
1215                  41.846154         7.688462  
1216                   4.565217         8.691304  
1217                   0.100000         6.445000  

之后,您可以四舍五入这些值:

df2['new_cases'] = round(df2['new_cases'],0)