当我们有多个组时,基于控件的样本标准化
Normalization of samples based on the controls when we have several groups
假设我们有以下 DataFrame:
data = {'Compounds': ['Drug_A', 'Drug_A', 'Drug_A', 'Drug_A', 'Drug_A', 'Drug_A', 'Drug_B', 'Drug_B',
'Drug_B','Drug_B','Drug_B','Drug_B','Drug_B','Drug_B','Drug_B','Drug_B','Drug_B','Drug_B',
'Drug_C', 'Drug_C','Drug_C','Drug_C','Drug_C','Drug_C','Drug_C','Drug_C','Drug_C','Drug_C', np.nan,
np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan,
np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
'values': [24, 20, 48, 17, 20, 8, 22, 16, 46, 44, 12, 38, 26, 16, 19, 23, 9, 39, 19, 24, 43, 6, 24, 46, 26, 15, 8,
22, 22, 32, 23, 41, 8, 46, 29, 34, 34, 39, 32, 22, 28, 34, 29, 19, 44, 22, 17, 41, 19, 39, 27, 46, 37, 26],
'identifier': ['Sample', 'Sample','Sample','Sample','Sample','Sample','Sample','Sample','Sample','Sample',
'Sample','Sample','Sample','Sample','Sample','Sample','Sample','Sample','Sample','Sample',
'Sample','Sample','Sample','Sample','Sample','Sample','Sample','Sample', 'Control', 'Control',
'Control','Control','Control','Control','Control','Control','Control','Control','Control',
'Control','Control','Control','Control','Control','Control','Control','Control','Control',
'Control','Control','Control','Control','Control','Control',],
'Experiment': ['P1', 'P1', 'P2',
'P2', 'P3', 'P3', 'P1', 'P1', 'P1', 'P2', 'P2', 'P2', 'P3', 'P3', 'P1', 'P1', 'P1', 'P2', 'P2',
'P2', 'P2', 'P2', 'P3', 'P3', 'P1','P1', 'P1', 'P1', 'P1', 'P1', 'P1', 'P1',
'P2', 'P2','P2','P2','P2','P2','P2','P2','P2','P2','P2','P2','P3','P3','P3','P3','P3','P3', 'P1', 'P2',
'P3','P1' ]}
df = pd.DataFrame(data)
在标识符列中,我们有样本值和对照值。
我们首先要:
计算来自不同实验(即 P1、P2、P3)的所有对照的列 'values' 的平均值:
df_control = df.loc[df['identifier'] == 'Control']
z = df_control['values'].mean()
如果我想写在一行中,上面的脚本的紧凑形式是什么?我可以使用 list comprehensive 吗?
接下来,为了归一化的目的,我们要将 z 除以每个实验 P1、P2、P3 中对照的平均值 'values',分别得到每个实验的 normalization_factor .
最后,将每个特定实验的归一化因子乘以属于该实验的样本值。
最简单直接的方法是什么?
感谢您的热心帮助!
这是您要找的吗?
df.groupby(by=['identifier']).mean()
Out:
values
identifier
Control 30.384615
Sample 24.285714
然后:
df.groupby(by=['identifier', 'Experiment']).mean()
Out:
values
identifier Experiment
Control P1 28.500000
P2 30.769231
P3 31.285714
Sample P1 20.833333
P2 29.000000
P3 23.333333
第二个 MultiIndex
可用于访问数据:
MultiIndex([('Control', 'P1'),
('Control', 'P2'),
('Control', 'P3'),
( 'Sample', 'P1'),
( 'Sample', 'P2'),
( 'Sample', 'P3')],
names=['identifier', 'Experiment'])
您现在可以在此基础上构建为:
all_mean = df.groupby(by=['identifier']).mean()
spec_mean = df.groupby(by=['identifier', 'Experiment']).mean()
result = all_mean/spec_mean
Out
values
identifier Experiment
Control P1 1.066127
P2 0.987500
P3 0.971198
Sample P1 1.165714
P2 0.837438
P3 1.040816
现在将数据转化为某种平面结构(?OP 对此没有明确说明):
normalization_factors = {idx[1]: result.loc[idx].values[0] for idx in result.index if idx[0] == 'Control'}
# {'P1': 1.0661268556005397, 'P2': 0.9874999999999999, 'P3': 0.9711977520196698}
sample_values = {idx[1]: result.loc[idx].values[0] * normalization_factors[idx[1]] for idx in result.index if idx[0] == 'Sample'}
# {'P1': 1.2427993059572005, 'P2': 0.8269704433497537, 'P3': 1.0108384765919014}
将 sample_data
映射到 df
为:
df["calculated_col_with_the_name_you_prefer"] = df["Experiment"].map(sample_values)
假设我们有以下 DataFrame:
data = {'Compounds': ['Drug_A', 'Drug_A', 'Drug_A', 'Drug_A', 'Drug_A', 'Drug_A', 'Drug_B', 'Drug_B',
'Drug_B','Drug_B','Drug_B','Drug_B','Drug_B','Drug_B','Drug_B','Drug_B','Drug_B','Drug_B',
'Drug_C', 'Drug_C','Drug_C','Drug_C','Drug_C','Drug_C','Drug_C','Drug_C','Drug_C','Drug_C', np.nan,
np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan,
np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
'values': [24, 20, 48, 17, 20, 8, 22, 16, 46, 44, 12, 38, 26, 16, 19, 23, 9, 39, 19, 24, 43, 6, 24, 46, 26, 15, 8,
22, 22, 32, 23, 41, 8, 46, 29, 34, 34, 39, 32, 22, 28, 34, 29, 19, 44, 22, 17, 41, 19, 39, 27, 46, 37, 26],
'identifier': ['Sample', 'Sample','Sample','Sample','Sample','Sample','Sample','Sample','Sample','Sample',
'Sample','Sample','Sample','Sample','Sample','Sample','Sample','Sample','Sample','Sample',
'Sample','Sample','Sample','Sample','Sample','Sample','Sample','Sample', 'Control', 'Control',
'Control','Control','Control','Control','Control','Control','Control','Control','Control',
'Control','Control','Control','Control','Control','Control','Control','Control','Control',
'Control','Control','Control','Control','Control','Control',],
'Experiment': ['P1', 'P1', 'P2',
'P2', 'P3', 'P3', 'P1', 'P1', 'P1', 'P2', 'P2', 'P2', 'P3', 'P3', 'P1', 'P1', 'P1', 'P2', 'P2',
'P2', 'P2', 'P2', 'P3', 'P3', 'P1','P1', 'P1', 'P1', 'P1', 'P1', 'P1', 'P1',
'P2', 'P2','P2','P2','P2','P2','P2','P2','P2','P2','P2','P2','P3','P3','P3','P3','P3','P3', 'P1', 'P2',
'P3','P1' ]}
df = pd.DataFrame(data)
在标识符列中,我们有样本值和对照值。 我们首先要: 计算来自不同实验(即 P1、P2、P3)的所有对照的列 'values' 的平均值:
df_control = df.loc[df['identifier'] == 'Control']
z = df_control['values'].mean()
如果我想写在一行中,上面的脚本的紧凑形式是什么?我可以使用 list comprehensive 吗?
接下来,为了归一化的目的,我们要将 z 除以每个实验 P1、P2、P3 中对照的平均值 'values',分别得到每个实验的 normalization_factor .
最后,将每个特定实验的归一化因子乘以属于该实验的样本值。
最简单直接的方法是什么? 感谢您的热心帮助!
这是您要找的吗?
df.groupby(by=['identifier']).mean()
Out:
values
identifier
Control 30.384615
Sample 24.285714
然后:
df.groupby(by=['identifier', 'Experiment']).mean()
Out:
values
identifier Experiment
Control P1 28.500000
P2 30.769231
P3 31.285714
Sample P1 20.833333
P2 29.000000
P3 23.333333
第二个 MultiIndex
可用于访问数据:
MultiIndex([('Control', 'P1'),
('Control', 'P2'),
('Control', 'P3'),
( 'Sample', 'P1'),
( 'Sample', 'P2'),
( 'Sample', 'P3')],
names=['identifier', 'Experiment'])
您现在可以在此基础上构建为:
all_mean = df.groupby(by=['identifier']).mean()
spec_mean = df.groupby(by=['identifier', 'Experiment']).mean()
result = all_mean/spec_mean
Out
values
identifier Experiment
Control P1 1.066127
P2 0.987500
P3 0.971198
Sample P1 1.165714
P2 0.837438
P3 1.040816
现在将数据转化为某种平面结构(?OP 对此没有明确说明):
normalization_factors = {idx[1]: result.loc[idx].values[0] for idx in result.index if idx[0] == 'Control'}
# {'P1': 1.0661268556005397, 'P2': 0.9874999999999999, 'P3': 0.9711977520196698}
sample_values = {idx[1]: result.loc[idx].values[0] * normalization_factors[idx[1]] for idx in result.index if idx[0] == 'Sample'}
# {'P1': 1.2427993059572005, 'P2': 0.8269704433497537, 'P3': 1.0108384765919014}
将 sample_data
映射到 df
为:
df["calculated_col_with_the_name_you_prefer"] = df["Experiment"].map(sample_values)