Dataframe 使用条件填充 NaN 平均值
Dataframe fill NaN average values using conditions
Pandas 的新手,我下载了一些 public COVID 数据,现在我正在尝试填充 NaN
值。我的数据框看起来像这样:
现在我试图在 new_cases
列中填写 NaN
值。起初我使用 :
将列的平均值添加到这些值
df['new_cases'] = df['new_cases'].fillna(value=df.new_cases.mean().astype(int), inplace=True)
然而,有不同的国家,几周所以在考虑之后,平均值并不能真正提供信息。
我想做的是取同一国家/地区前后一周的平均值。这会更有意义,但我不知道该怎么做。
我想这就是您要找的。如您所说,您想要按国家和周将 nan
替换为前一个和下一个 new_cases
值的平均值。因此,您的数据如下所示:
index country country_code year_week source new_cases number_sequenced \
0 0 Austria AT 2020-40 GISAID 5152.0 4
1 1 Austria AT 2020-40 GISAID 5152.0 4
2 2 Austria AT 2020-40 GISAID 5152.0 4
3 3 Austria AT 2020-40 GISAID 5152.0 4
4 4 Austria AT 2020-40 GISAID 5152.0 4
5 5 Austria AT 2020-40 GISAID 5152.0 4
6 6 Austria AT 2020-40 GISAID 5152.0 4
7 7 Austria AT 2020-40 GISAID 5152.0 4
8 8 Austria AT 2020-40 GISAID 5152.0 4
9 9 Austria AT 2020-40 GISAID 5152.0 4
percent_cases_sequenced valid_denominator variant \
0 0.1 Yes B.1.1.7
1 0.1 Yes B.1.1.7+E484K
2 0.1 Yes B.1.351
3 0.1 Yes B.1.427/B.1.429
4 0.1 Yes B.1.525
5 0.1 Yes B.1.526
6 0.1 Yes B.1.616
7 0.1 Yes B.1.617
8 0.1 Yes B.1.617.1
9 0.1 Yes B.1.617.2
number_detections_variant percent_variant
0 0 0.0
1 0 0.0
2 0 0.0
3 0 0.0
4 0 0.0
5 0 0.0
6 0 0.0
7 0 0.0
8 0 0.0
9 0 0.0
为了知道要寻找什么,我查看了 nan
所在的位置:
df_null = df[df['new_cases'].isna()]
df_null
即:
country country_code year_week source new_cases number_sequenced \
24202 Spain ES 2021-08 GISAID NaN 1195
24203 Spain ES 2021-08 GISAID NaN 1195
24204 Spain ES 2021-08 GISAID NaN 1195
24205 Spain ES 2021-08 GISAID NaN 1195
24206 Spain ES 2021-08 GISAID NaN 1195
24207 Spain ES 2021-08 GISAID NaN 1195
24208 Spain ES 2021-08 GISAID NaN 1195
24209 Spain ES 2021-08 GISAID NaN 1195
24210 Spain ES 2021-08 GISAID NaN 1195
24211 Spain ES 2021-08 GISAID NaN 1195
24212 Spain ES 2021-08 GISAID NaN 1195
24213 Spain ES 2021-08 GISAID NaN 1195
24214 Spain ES 2021-08 GISAID NaN 1195
24215 Spain ES 2021-08 GISAID NaN 1195
24216 Spain ES 2021-08 GISAID NaN 1195
24217 Spain ES 2021-08 GISAID NaN 1195
24218 Spain ES 2021-08 GISAID NaN 1195
24629 Spain ES 2021-08 TESSy NaN 998
24630 Spain ES 2021-08 TESSy NaN 998
24631 Spain ES 2021-08 TESSy NaN 998
24632 Spain ES 2021-08 TESSy NaN 998
24633 Spain ES 2021-08 TESSy NaN 998
24634 Spain ES 2021-08 TESSy NaN 998
percent_cases_sequenced valid_denominator variant \
24202 NaN Yes B.1.1.7
24203 NaN Yes B.1.1.7+E484K
24204 NaN Yes B.1.351
24205 NaN Yes B.1.427/B.1.429
24206 NaN Yes B.1.525
24207 NaN Yes B.1.526
24208 NaN Yes B.1.616
24209 NaN Yes B.1.617
24210 NaN Yes B.1.617.1
24211 NaN Yes B.1.617.2
24212 NaN Yes B.1.617.3
24213 NaN Yes B.1.620
24214 NaN Yes B.1.621
24215 NaN Yes C.37
24216 NaN Yes P.1
24217 NaN Yes P.3
24218 NaN Yes Other
24629 NaN Yes B.1.1.7
24630 NaN Yes B.1.351
24631 NaN Yes B.1.525
24632 NaN Yes P.1
24633 NaN Yes UNK
24634 NaN Yes Other
number_detections_variant percent_variant
24202 703 58.8
24203 0 0.0
24204 10 0.8
24205 1 0.1
24206 2 0.2
24207 0 0.0
24208 0 0.0
24209 0 0.0
24210 0 0.0
24211 0 0.0
24212 0 0.0
24213 0 0.0
24214 0 0.0
24215 0 0.0
24216 2 0.2
24217 0 0.0
24218 477 39.9
24629 682 68.3
24630 4 0.4
24631 1 0.1
24632 2 0.2
24633 65 6.5
24634 244 24.4
要填充这些地方,您需要执行以下操作:
df2 = pd.concat([df.ffill(), df.bfill()]).groupby(['country','year_week']).mean()
其中,要填写的数据给出:
country year_week new_cases number_sequenced percent_cases_sequenced \
1176 Spain 2020-40 65146.0 224.500000 0.340000
1177 Spain 2020-41 75556.0 216.800000 0.270000
1178 Spain 2020-42 85481.0 310.000000 0.323810
1179 Spain 2020-43 123871.0 195.736842 0.178947
1180 Spain 2020-44 142377.0 134.250000 0.085000
1181 Spain 2020-45 140521.0 171.200000 0.085000
1182 Spain 2020-46 115646.0 176.210526 0.178947
1183 Spain 2020-47 85752.0 170.700000 0.170000
1184 Spain 2020-48 65571.0 125.250000 0.185000
1185 Spain 2020-49 54141.0 295.105263 0.568421
1186 Spain 2020-50 49556.0 390.900000 0.755000
1187 Spain 2020-51 67365.0 485.300000 0.755000
1188 Spain 2020-52 60164.0 383.523810 0.638095
1189 Spain 2020-53 79431.0 629.150000 0.770000
1190 Spain 2021-01 152938.0 601.250000 0.400000
1191 Spain 2021-02 224669.0 966.095238 0.400000
1192 Spain 2021-03 256931.0 1400.714286 0.561905
1193 Spain 2021-04 229423.0 1358.826087 0.547826
1194 Spain 2021-05 166280.0 1380.952381 0.842857
1195 Spain 2021-06 97201.0 1136.142857 1.142857
1196 Spain 2021-07 67685.0 1259.333333 1.866667
1197 Spain 2021-08 51494.0 1143.608696 2.617391
1198 Spain 2021-09 35303.0 1226.318182 3.440909
1199 Spain 2021-10 34092.0 1236.125000 3.612500
1200 Spain 2021-11 33741.0 1145.000000 3.404167
1201 Spain 2021-12 42022.0 1450.208333 3.445833
1202 Spain 2021-13 40500.0 1408.583333 3.475000
1203 Spain 2021-14 58931.0 1326.307692 2.234615
1204 Spain 2021-15 58098.0 1540.391304 2.673913
1205 Spain 2021-16 60115.0 1168.538462 1.938462
1206 Spain 2021-17 51961.0 1286.925926 2.440741
1207 Spain 2021-18 40962.0 1326.555556 3.229630
1208 Spain 2021-19 34468.0 1184.653846 3.457692
1209 Spain 2021-20 31660.0 1053.846154 3.319231
1210 Spain 2021-21 30870.0 1095.269231 3.550000
1211 Spain 2021-22 29133.0 1071.730769 3.650000
1212 Spain 2021-23 34244.0 852.923077 2.476923
1213 Spain 2021-24 22884.0 727.461538 3.176923
1214 Spain 2021-25 27991.0 637.200000 2.284000
1215 Spain 2021-26 73833.0 475.384615 0.607692
1216 Spain 2021-27 104649.0 42.695652 0.026087
1217 Spain 2021-28 190726.0 0.300000 0.000000
number_detections_variant percent_variant
1176 14.400000 9.995000
1177 15.800000 10.000000
1178 19.000000 9.528571
1179 12.210526 10.526316
1180 9.050000 10.005000
1181 11.800000 9.995000
1182 10.736842 10.526316
1183 10.700000 9.995000
1184 8.850000 9.995000
1185 24.000000 10.526316
1186 33.000000 10.000000
1187 43.000000 10.000000
1188 39.238095 9.528571
1189 55.950000 10.000000
1190 58.550000 9.995000
1191 85.523810 9.523810
1192 117.571429 9.519048
1193 108.739130 8.700000
1194 117.428571 9.519048
1195 99.095238 9.528571
1196 105.904762 9.523810
1197 95.347826 8.691304
1198 104.863636 9.100000
1199 101.291667 8.341667
1200 99.583333 8.337500
1201 124.375000 8.329167
1202 122.000000 8.333333
1203 102.923077 7.696154
1204 133.739130 8.700000
1205 92.538462 7.696154
1206 91.851852 7.407407
1207 94.000000 7.411111
1208 88.346154 7.688462
1209 86.461538 7.692308
1210 88.807692 7.688462
1211 86.807692 7.684615
1212 68.615385 7.684615
1213 57.923077 7.692308
1214 55.800000 7.996000
1215 41.846154 7.688462
1216 4.565217 8.691304
1217 0.100000 6.445000
之后,您可以四舍五入这些值:
df2['new_cases'] = round(df2['new_cases'],0)
Pandas 的新手,我下载了一些 public COVID 数据,现在我正在尝试填充 NaN
值。我的数据框看起来像这样:
现在我试图在 new_cases
列中填写 NaN
值。起初我使用 :
df['new_cases'] = df['new_cases'].fillna(value=df.new_cases.mean().astype(int), inplace=True)
然而,有不同的国家,几周所以在考虑之后,平均值并不能真正提供信息。 我想做的是取同一国家/地区前后一周的平均值。这会更有意义,但我不知道该怎么做。
我想这就是您要找的。如您所说,您想要按国家和周将 nan
替换为前一个和下一个 new_cases
值的平均值。因此,您的数据如下所示:
index country country_code year_week source new_cases number_sequenced \
0 0 Austria AT 2020-40 GISAID 5152.0 4
1 1 Austria AT 2020-40 GISAID 5152.0 4
2 2 Austria AT 2020-40 GISAID 5152.0 4
3 3 Austria AT 2020-40 GISAID 5152.0 4
4 4 Austria AT 2020-40 GISAID 5152.0 4
5 5 Austria AT 2020-40 GISAID 5152.0 4
6 6 Austria AT 2020-40 GISAID 5152.0 4
7 7 Austria AT 2020-40 GISAID 5152.0 4
8 8 Austria AT 2020-40 GISAID 5152.0 4
9 9 Austria AT 2020-40 GISAID 5152.0 4
percent_cases_sequenced valid_denominator variant \
0 0.1 Yes B.1.1.7
1 0.1 Yes B.1.1.7+E484K
2 0.1 Yes B.1.351
3 0.1 Yes B.1.427/B.1.429
4 0.1 Yes B.1.525
5 0.1 Yes B.1.526
6 0.1 Yes B.1.616
7 0.1 Yes B.1.617
8 0.1 Yes B.1.617.1
9 0.1 Yes B.1.617.2
number_detections_variant percent_variant
0 0 0.0
1 0 0.0
2 0 0.0
3 0 0.0
4 0 0.0
5 0 0.0
6 0 0.0
7 0 0.0
8 0 0.0
9 0 0.0
为了知道要寻找什么,我查看了 nan
所在的位置:
df_null = df[df['new_cases'].isna()]
df_null
即:
country country_code year_week source new_cases number_sequenced \
24202 Spain ES 2021-08 GISAID NaN 1195
24203 Spain ES 2021-08 GISAID NaN 1195
24204 Spain ES 2021-08 GISAID NaN 1195
24205 Spain ES 2021-08 GISAID NaN 1195
24206 Spain ES 2021-08 GISAID NaN 1195
24207 Spain ES 2021-08 GISAID NaN 1195
24208 Spain ES 2021-08 GISAID NaN 1195
24209 Spain ES 2021-08 GISAID NaN 1195
24210 Spain ES 2021-08 GISAID NaN 1195
24211 Spain ES 2021-08 GISAID NaN 1195
24212 Spain ES 2021-08 GISAID NaN 1195
24213 Spain ES 2021-08 GISAID NaN 1195
24214 Spain ES 2021-08 GISAID NaN 1195
24215 Spain ES 2021-08 GISAID NaN 1195
24216 Spain ES 2021-08 GISAID NaN 1195
24217 Spain ES 2021-08 GISAID NaN 1195
24218 Spain ES 2021-08 GISAID NaN 1195
24629 Spain ES 2021-08 TESSy NaN 998
24630 Spain ES 2021-08 TESSy NaN 998
24631 Spain ES 2021-08 TESSy NaN 998
24632 Spain ES 2021-08 TESSy NaN 998
24633 Spain ES 2021-08 TESSy NaN 998
24634 Spain ES 2021-08 TESSy NaN 998
percent_cases_sequenced valid_denominator variant \
24202 NaN Yes B.1.1.7
24203 NaN Yes B.1.1.7+E484K
24204 NaN Yes B.1.351
24205 NaN Yes B.1.427/B.1.429
24206 NaN Yes B.1.525
24207 NaN Yes B.1.526
24208 NaN Yes B.1.616
24209 NaN Yes B.1.617
24210 NaN Yes B.1.617.1
24211 NaN Yes B.1.617.2
24212 NaN Yes B.1.617.3
24213 NaN Yes B.1.620
24214 NaN Yes B.1.621
24215 NaN Yes C.37
24216 NaN Yes P.1
24217 NaN Yes P.3
24218 NaN Yes Other
24629 NaN Yes B.1.1.7
24630 NaN Yes B.1.351
24631 NaN Yes B.1.525
24632 NaN Yes P.1
24633 NaN Yes UNK
24634 NaN Yes Other
number_detections_variant percent_variant
24202 703 58.8
24203 0 0.0
24204 10 0.8
24205 1 0.1
24206 2 0.2
24207 0 0.0
24208 0 0.0
24209 0 0.0
24210 0 0.0
24211 0 0.0
24212 0 0.0
24213 0 0.0
24214 0 0.0
24215 0 0.0
24216 2 0.2
24217 0 0.0
24218 477 39.9
24629 682 68.3
24630 4 0.4
24631 1 0.1
24632 2 0.2
24633 65 6.5
24634 244 24.4
要填充这些地方,您需要执行以下操作:
df2 = pd.concat([df.ffill(), df.bfill()]).groupby(['country','year_week']).mean()
其中,要填写的数据给出:
country year_week new_cases number_sequenced percent_cases_sequenced \
1176 Spain 2020-40 65146.0 224.500000 0.340000
1177 Spain 2020-41 75556.0 216.800000 0.270000
1178 Spain 2020-42 85481.0 310.000000 0.323810
1179 Spain 2020-43 123871.0 195.736842 0.178947
1180 Spain 2020-44 142377.0 134.250000 0.085000
1181 Spain 2020-45 140521.0 171.200000 0.085000
1182 Spain 2020-46 115646.0 176.210526 0.178947
1183 Spain 2020-47 85752.0 170.700000 0.170000
1184 Spain 2020-48 65571.0 125.250000 0.185000
1185 Spain 2020-49 54141.0 295.105263 0.568421
1186 Spain 2020-50 49556.0 390.900000 0.755000
1187 Spain 2020-51 67365.0 485.300000 0.755000
1188 Spain 2020-52 60164.0 383.523810 0.638095
1189 Spain 2020-53 79431.0 629.150000 0.770000
1190 Spain 2021-01 152938.0 601.250000 0.400000
1191 Spain 2021-02 224669.0 966.095238 0.400000
1192 Spain 2021-03 256931.0 1400.714286 0.561905
1193 Spain 2021-04 229423.0 1358.826087 0.547826
1194 Spain 2021-05 166280.0 1380.952381 0.842857
1195 Spain 2021-06 97201.0 1136.142857 1.142857
1196 Spain 2021-07 67685.0 1259.333333 1.866667
1197 Spain 2021-08 51494.0 1143.608696 2.617391
1198 Spain 2021-09 35303.0 1226.318182 3.440909
1199 Spain 2021-10 34092.0 1236.125000 3.612500
1200 Spain 2021-11 33741.0 1145.000000 3.404167
1201 Spain 2021-12 42022.0 1450.208333 3.445833
1202 Spain 2021-13 40500.0 1408.583333 3.475000
1203 Spain 2021-14 58931.0 1326.307692 2.234615
1204 Spain 2021-15 58098.0 1540.391304 2.673913
1205 Spain 2021-16 60115.0 1168.538462 1.938462
1206 Spain 2021-17 51961.0 1286.925926 2.440741
1207 Spain 2021-18 40962.0 1326.555556 3.229630
1208 Spain 2021-19 34468.0 1184.653846 3.457692
1209 Spain 2021-20 31660.0 1053.846154 3.319231
1210 Spain 2021-21 30870.0 1095.269231 3.550000
1211 Spain 2021-22 29133.0 1071.730769 3.650000
1212 Spain 2021-23 34244.0 852.923077 2.476923
1213 Spain 2021-24 22884.0 727.461538 3.176923
1214 Spain 2021-25 27991.0 637.200000 2.284000
1215 Spain 2021-26 73833.0 475.384615 0.607692
1216 Spain 2021-27 104649.0 42.695652 0.026087
1217 Spain 2021-28 190726.0 0.300000 0.000000
number_detections_variant percent_variant
1176 14.400000 9.995000
1177 15.800000 10.000000
1178 19.000000 9.528571
1179 12.210526 10.526316
1180 9.050000 10.005000
1181 11.800000 9.995000
1182 10.736842 10.526316
1183 10.700000 9.995000
1184 8.850000 9.995000
1185 24.000000 10.526316
1186 33.000000 10.000000
1187 43.000000 10.000000
1188 39.238095 9.528571
1189 55.950000 10.000000
1190 58.550000 9.995000
1191 85.523810 9.523810
1192 117.571429 9.519048
1193 108.739130 8.700000
1194 117.428571 9.519048
1195 99.095238 9.528571
1196 105.904762 9.523810
1197 95.347826 8.691304
1198 104.863636 9.100000
1199 101.291667 8.341667
1200 99.583333 8.337500
1201 124.375000 8.329167
1202 122.000000 8.333333
1203 102.923077 7.696154
1204 133.739130 8.700000
1205 92.538462 7.696154
1206 91.851852 7.407407
1207 94.000000 7.411111
1208 88.346154 7.688462
1209 86.461538 7.692308
1210 88.807692 7.688462
1211 86.807692 7.684615
1212 68.615385 7.684615
1213 57.923077 7.692308
1214 55.800000 7.996000
1215 41.846154 7.688462
1216 4.565217 8.691304
1217 0.100000 6.445000
之后,您可以四舍五入这些值:
df2['new_cases'] = round(df2['new_cases'],0)