当我们有多个组时，基于控件的样本标准化

Question

假设我们有以下 DataFrame：

data = {'Compounds': ['Drug_A', 'Drug_A', 'Drug_A', 'Drug_A', 'Drug_A', 'Drug_A', 'Drug_B', 'Drug_B',
                   'Drug_B','Drug_B','Drug_B','Drug_B','Drug_B','Drug_B','Drug_B','Drug_B','Drug_B','Drug_B',
                   'Drug_C', 'Drug_C','Drug_C','Drug_C','Drug_C','Drug_C','Drug_C','Drug_C','Drug_C','Drug_C', np.nan, 
                   np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan,
                   np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan], 
        'values': [24, 20, 48, 17, 20, 8, 22, 16, 46, 44, 12, 38, 26, 16, 19, 23, 9, 39, 19, 24, 43, 6, 24, 46, 26, 15, 8, 
                  22, 22, 32, 23, 41, 8, 46, 29, 34, 34, 39, 32, 22, 28, 34, 29, 19, 44, 22, 17, 41, 19, 39, 27, 46, 37, 26],
      'identifier': ['Sample', 'Sample','Sample','Sample','Sample','Sample','Sample','Sample','Sample','Sample',
                    'Sample','Sample','Sample','Sample','Sample','Sample','Sample','Sample','Sample','Sample',
                    'Sample','Sample','Sample','Sample','Sample','Sample','Sample','Sample', 'Control', 'Control',
                    'Control','Control','Control','Control','Control','Control','Control','Control','Control',
                    'Control','Control','Control','Control','Control','Control','Control','Control','Control',
                    'Control','Control','Control','Control','Control','Control',], 
'Experiment': ['P1', 'P1', 'P2', 
                     'P2', 'P3', 'P3', 'P1', 'P1', 'P1', 'P2', 'P2', 'P2', 'P3', 'P3', 'P1', 'P1', 'P1', 'P2', 'P2', 
                    'P2', 'P2', 'P2', 'P3', 'P3', 'P1','P1', 'P1', 'P1', 'P1', 'P1', 'P1', 'P1', 
                    'P2', 'P2','P2','P2','P2','P2','P2','P2','P2','P2','P2','P2','P3','P3','P3','P3','P3','P3', 'P1', 'P2',
                                                                                           'P3','P1' ]}
df = pd.DataFrame(data)

在标识符列中，我们有样本值和对照值。我们首先要：计算来自不同实验（即 P1、P2、P3）的所有对照的列 'values' 的平均值：

df_control = df.loc[df['identifier'] == 'Control']
z = df_control['values'].mean()

如果我想写在一行中，上面的脚本的紧凑形式是什么？我可以使用 list comprehensive 吗？

接下来，为了归一化的目的，我们要将 z 除以每个实验 P1、P2、P3 中对照的平均值 'values'，分别得到每个实验的 normalization_factor .

最后，将每个特定实验的归一化因子乘以属于该实验的样本值。

最简单直接的方法是什么？感谢您的热心帮助！

Answer 1

这是您要找的吗？

df.groupby(by=['identifier']).mean()
Out: 
               values
identifier           
Control     30.384615
Sample      24.285714

然后：

df.groupby(by=['identifier', 'Experiment']).mean()
Out: 
                          values
identifier Experiment           
Control    P1          28.500000
           P2          30.769231
           P3          31.285714
Sample     P1          20.833333
           P2          29.000000
           P3          23.333333

第二个 MultiIndex 可用于访问数据：

MultiIndex([('Control', 'P1'),
            ('Control', 'P2'),
            ('Control', 'P3'),
            ( 'Sample', 'P1'),
            ( 'Sample', 'P2'),
            ( 'Sample', 'P3')],
           names=['identifier', 'Experiment'])

您现在可以在此基础上构建为：

all_mean = df.groupby(by=['identifier']).mean()
spec_mean = df.groupby(by=['identifier', 'Experiment']).mean()
result = all_mean/spec_mean

Out
                         values
identifier Experiment          
Control    P1          1.066127
           P2          0.987500
           P3          0.971198
Sample     P1          1.165714
           P2          0.837438
           P3          1.040816

现在将数据转化为某种平面结构（？OP 对此没有明确说明）：

normalization_factors = {idx[1]: result.loc[idx].values[0] for idx in result.index if idx[0] == 'Control'}
# {'P1': 1.0661268556005397, 'P2': 0.9874999999999999, 'P3': 0.9711977520196698}
sample_values = {idx[1]: result.loc[idx].values[0] * normalization_factors[idx[1]] for idx in result.index if idx[0] == 'Sample'}
# {'P1': 1.2427993059572005, 'P2': 0.8269704433497537, 'P3': 1.0108384765919014}

将 sample_data 映射到 df 为：

df["calculated_col_with_the_name_you_prefer"] = df["Experiment"].map(sample_values)

当我们有多个组时，基于控件的样本标准化

Normalization of samples based on the controls when we have several groups

python

normalization

pandas