pandas 多列按组填写

pandas fillna by group for multiple columns

dataset like this one(CSV 格式)中,有多列带有值,我如何使用 fillnadf.groupby("DateSent") 来用 [=15 填充所有需要的列=] 组的?

In [5]: df.head()
Out[5]: 
  ID  DateAcquired  DateSent         data   value  measurement    values
0  1      20210518  20220110  6358.434713   556.0   317.869897  3.565781
1  1      20210719  20220210  6508.458382  1468.0   774.337509  5.565384
2  1      20210719  20220310  6508.466246     1.0    40.837533  1.278085
3  1      20200420  20220410  6507.664194    48.0    64.335047  1.604183
4  1      20210328  20220510  6508.451227     0.0    40.337486  1.270236

根据 this other thread on SO,一种方法是一个接一个:

df["data"]        = df.groupby("DateSent")["data"].transform(lambda x: x.fillna(x.min()/3))
df["value"]       = df.groupby("DateSent")["value"].transform(lambda x: x.fillna(x.min()/3))
df["measurement"] = df.groupby("DateSent")["measurement"].transform(lambda x: x.fillna(x.min()/3))
df["values"]      = df.groupby("DateSent")["values"].transform(lambda x: x.fillna(x.min()/3))

在我有 100000 个这样的列的原始数据集中,我可以在技术上循环遍历所有需要的列名。但是有 better/faster 的方法吗?也许已经在 pandas?

中实现了一些东西

您可以执行此操作的一种方法是将您想要归因于列表中的所有列 - 我假设您需要所有 numerical 列(ID、DateAcquired、DataSent 除外)

fti = [i for i in df.iloc[:,3:].columns if df[i].dtypes != 'object'] # features to impute

然后,您可以创建一个新的 df,其中只有估算值:

imputed = df.groupby("DateSent")[fti].transform(lambda x: x.fillna(x.min()/3))

imputed.head(5)
          data   value  measurement    values
0  6358.434713   556.0   317.869897  3.565781
1  6508.458382  1468.0   774.337509  5.565384
2  6508.466246     1.0    40.837533  1.278085
3  6507.664194    48.0    64.335047  1.604183
4  6508.451227     0.0    40.337486  1.270236

最后你可以 concat:

res = pd.concat([df[df.columns.symmetric_difference(imputed.columns)],imputed],axis=1)

res.head(15)

    DateAcquired  DateSent ID         data   value  measurement    values
0       20210518  20220110  1  6358.434713   556.0   317.869897  3.565781
1       20210719  20220210  1  6508.458382  1468.0   774.337509  5.565384
2       20210719  20220310  1  6508.466246     1.0    40.837533  1.278085
3       20200420  20220410  1  6507.664194    48.0    64.335047  1.604183
4       20210328  20220510  1  6508.451227     0.0    40.337486  1.270236
5       20210518  20220610  1  6508.474031     3.0    15.000000  0.774597
6       20210108  20220110  2  6508.402472   897.0   488.837335  4.421933
7       20210110  20220210  2  6508.410493    52.0   111.000000  2.107131
8       20210119  20220310  2  6508.419065   800.0   440.337387  4.196844
9       20210108  20220410  2  6508.426063    89.0    84.837408  1.842144
10      20200109  20220510  2  6507.647600   978.0   529.334996  4.601456
11      20210919  20220610  2  6508.505563  1566.0   823.337655  5.738772
12      20211214  20220612  2  6508.528918   152.0   500.000000  4.472136
13      20210812  20220620  2  6508.497936   668.0   374.337631  3.869561
14      20210909  20220630  2  6508.506350   489.0   284.837657  3.375427