如何使用 pandas groupby() 的 split-apply-combine 模式同时规范化多列

Question

我正在尝试对 pandas 数据 table 中的实验数据进行归一化，该数据包含具有数值可观察值（特征）的多列、具有日期和实验条件的列以及其他非数值条件比如文件名。

我愿意

使用拆分-应用-组合模式
使用子组的汇总统计数据在组内进行标准化
使用不同的归一化（例如除以控制均值、Z 分数）
将此应用于所有数值列（可观察值）
最后，生成一个增强数据 table，它与原始数据具有相同的结构，但具有额外的列，例如对于 Observable1 列，应添加一列 normalized_Observable1

可以使用此代码段生成具有此结构的简化数据 table::

import numpy as np
import pandas as pd
df = pd.DataFrame({
   'condition': ['ctrl', 'abc', 'ctrl', 'abc', 'def', 'ctlr', 'ctlr', 'asdasd', 'afff', 'afff', 'gr1','gr2', 'gr2', 'ctrl', 'ctrl', 'kjkj','asht','ctrl'],
   'date':  ['20170131', '20170131', '20170131', '20170131','20170131', '20170606', '20170606', '20170606', '20170606', '20170606', '20170404', '20170404', '20170404', '20170404', '20170404', '20161212', '20161212', '20161212'],
   'observation1':  [1.2, 2.2, 1.3, 1.1, 2.3 , 2.3, 4.2, 3.3, 5.1, 3.3, 3.4, 5.5, 9.9, 3.2, 1.1, 3.3, 1.2, 5.4],
   'observation2':  [3.1, 2.2, 2.1, 1.2,  2.4, 1.2, 1.5, 1.33, 1.5, 1.6, 1.4, 1.3, 0.9, 0.78, 1.2, 4.0, 5.0, 6.0],
   'observation3':  [2.0, 1.2, 1.2, 2.01, 2.55, 2.05, 1.66, 3.2, 3.21, 3.04, 8.01, 9.1, 7.06, 8.1, 7.9, 5.12, 5.23, 5.15],
   'rawsource': ["1.tif", "2.tif", "3.tif",  "4.tif", "5.tif","6.tif", "7.tif", "8.tif", "9.tif", "10.tif", "11.tif", "12.tif", "13.tif", "14.tif", "15.tif", "16.tif", "17.tif", "18.tif"]
})
print(df)

看起来像这样

   condition      date  observation1  observation2  observation3 rawsource
0       ctrl  20170131           1.2          3.10          2.00     1.tif
1        abc  20170131           2.2          2.20          1.20     2.tif
2       ctrl  20170131           1.3          2.10          1.20     3.tif
3        abc  20170131           1.1          1.20          2.01     4.tif
4        def  20170131           2.3          2.40          2.55     5.tif
5       ctlr  20170606           2.3          1.20          2.05     6.tif
6       ctlr  20170606           4.2          1.50          1.66     7.tif
7     asdasd  20170606           3.3          1.33          3.20     8.tif
8       afff  20170606           5.1          1.50          3.21     9.tif
9       afff  20170606           3.3          1.60          3.04    10.tif
10       gr1  20170404           3.4          1.40          8.01    11.tif
11       gr2  20170404           5.5          1.30          9.10    12.tif
12       gr2  20170404           9.9          0.90          7.06    13.tif
13      ctrl  20170404           3.2          0.78          8.10    14.tif
14      ctrl  20170404           1.1          1.20          7.90    15.tif
15      kjkj  20161212           3.3          4.00          5.12    16.tif
16      asht  20161212           1.2          5.00          5.23    17.tif
17      ctrl  20161212           5.4          6.00          5.15    18.tif

现在，对于每个实验日期，我都有不同的实验条件，但我总是将条件命名为 ctrl。我想要执行的规范化之一是计算（对于每个数字列）该日期的控制实验的平均值，然后将该日期的所有可观察值除以它们相应的平均值。

我可以使用以下方法快速计算某些按日期、按条件的摘要统计信息：

grsummary = df.groupby(["date","condition"]).agg((min, max, np.nanmean, np.nanstd))

然后我想在每个实验日期的标准化中应用这些汇总统计数据：

grdate = df.groupby("date")

并以如下方式应用规范化：

def normalize_by_ctrlmean(grp_frame, summarystats):
    #  the following is only pseudo-code as I don't know how to do this
    grp_frame/ summarystats(nanmean)

grdate.apply(normalize_by_cntrlmean, summarystats= grsummary)

最后一步只是伪代码。这就是我正在努力解决的问题。我可以使用嵌套的 for 循环对数字列的日期、条件和列名进行规范化，但我是拆分应用组合范例的新手，我认为必须有一个简单的解决方案？非常感谢任何帮助。

Answer 1

我对您想要的功能有点困惑。我没有足够的声誉来发表评论，所以我会给出我最好的猜测来尝试回答你的问题。

看到你的函数被称为 normalize_by_ctrlmean，我假设你想在每个观察中始终除以每年 ctrl 组的 mean。为此，我们必须使用 melt 函数稍微整理一下您的数据。

df1 = df.melt(id_vars = ['condition',
                         'date',
                         'rawsource'],
              value_vars = ['observation1',
                            'observation2',
                            'observation3'],
              var_name = 'observations')

df1.head()

接下来我们将计算ctrl组的mean

ctrl_mean = df1[df1.condition == 'ctrl'].groupby(['date',
                                                  'observations']).agg('mean').reset_index().rename(columns = {'value' : 'ctrl_mean'})

ctrl_mean

将此数据框与融化的数据框合并。

df2 = df1.merge(ctrl_mean,
                how = 'inner',
                on = ['date',
                      'observations'])

df2.head()

最后，将 value 列除以 ctrl_mean 列并插入数据框。

df2.insert(df2.shape[1],
           'normalize_by_ctrlmean',
           df2.loc[:, 'value'] / df2.loc[:, 'ctrl_mean'])

df2.head()

希望这能让您更接近您的需求。

编辑

根据您的评论，我将展示如何首先使用 pivot_table 函数，然后使用 groupby 函数返回到您使用 observation 列的类似数据框。

pivot_table

df2.pivot_table(index = ['date', # columns to use as the index
                   'condition',
                   'rawsource'],
          columns = 'observations', # this will make columns out of the values in this column
          values = ['value', # these will be the values in each column
                    'ctrl_mean', # swaplevel swaps the column levels (axis = 1), sort_index sorts and "smooshes" them together
                    'normalize_by_ctrlmean']).swaplevel(axis = 1).sort_index(axis = 1).reset_index() # reset_index so you can refer to specific columns

groupby

df2.groupby(['date', # groupby these columns to make the index
             'condition',
             'rawsource',
             'observations']).agg({'value' : 'max', # take the max of these as the aggregate (there was only one value for each so the max just returns that value)
                                   'ctrl_mean' : 'max', # unstack('observations') makes columns out of the 'observations'
                                   'normalize_by_ctrlmean' : 'max'}).unstack('observations').swaplevel(axis = 1).sort_index(axis = 1).reset_index() # these do the same thing as on the pivot_table example

此外，您可以删除 swaplevel 和 sort_index 函数以将聚合列保留在顶层而不是 observations

Answer 2

以下是使用 df.apply 执行此操作的方法：

拆分

既然要进行操作'per date'，只需要按日期拆分：

grdate = df.groupby("date")

应用并结合

接下来，定义一个可以应用于每个组的转换函数，将组本身作为参数。

在您的情况下，该函数应计算该组 ctrl 值的平均值，然后将该组的所有观察值除以该平均值：

def norm_apply(group):

    # Select the 'ctrl' condition
    ctrl_selected = group[group['condition']=='ctrl']

    # Extract its numerical values
    ctrl_numeric = ctrl_selected.select_dtypes(include=[np.number])

    # Compute the means (column-wise)
    ctrl_means = np.nanmean(ctrl_numeric,axis=0) 

    # Extract numerical values for all conditions
    group_numeric = group.select_dtypes(include=[np.number])

    # Divide by the ctrl means
    divided = group_numeric / ctrl_means

    # Return result
    return divided

（如果你愿意，你可以把它当作一个愚蠢的单行代码...）

norm_apply = lambda x : x.select_dtypes(include=[np.number]) / np.nanmean(x[x['condition']=='ctrl'].select_dtypes(include=[np.number]),axis=0)

现在你可以简单地 apply 这个函数到你的分组数据框：

normed = grdate.apply(norm_apply)

这应该会为您提供所需的值，组合成与原始 df:

相同的 shape/order

normed.head()

>>   observation1  observation2  observation3
0          0.96      1.192308       1.25000
1          1.76      0.846154       0.75000
2          1.04      0.807692       0.75000
3          0.88      0.461538       1.25625
4          1.84      0.923077       1.59375

合并到原始 DataFrame

将这些结果添加回原始 df 的一种方法如下：

# Add prefix to column names
normed = normed.add_prefix('normed_')

# Concatenate with initial data frame
final = pd.concat([df,normed],axis=1)
display(final.head())

最后，您可以按日期和条件分组并查看均值：

final.groupby(['date','condition']).mean()

如果一切正常，ctlr条件的均值应该都是1.0。

（旁注：虽然 Ian Thompson 的回答也有效，但我相信这种方法更贴近拆分-应用-组合意识形态。）

如何使用 pandas groupby() 的 split-apply-combine 模式同时规范化多列

How to use split-apply-combine pattern of pandas groupby() to normalize multiple columns simultaneously

python

normalization

pandas

split-apply-combine

拆分

应用并结合

合并到原始 DataFrame