使用 pandas groupby() 添加方程的计算列
Adding a calculated column of an equation using pandas groupby()
我刚开始使用 pandas,我正在尝试根据 Spotfire 计算列公式
添加一个带有 groupby() 的列
假设我有一个 table 具有以下数据 (df1):
'Well ID','Assay','Source','Treat','BkgrdSub Fluorescence','Calced'
'A1',4,'Source 1','OPA',-215.75,0.035583351
'A2',4,'Source 1','OPA',-160.75,0.130472288
'A3',4,'Source 1','OPA',343.25,1
'H10',6,'Source 1','OPP',9896,1
'H11',6,'Source 1','OPP',9892,0.999605226
'H12',6,'Source 1','CN',-1,1
'A1',4,'Source 2','OPA',-170,0.03682641
'A2',4,'Source 2','OPA',-86,0.083431583
'A3',4,'Source 2','OPA',1566,1
'H10',6,'Source 2','ZI',4885,0.809271732
'H11',6,'Source 2','ZI',6092,1
'H12',6,'Source 2','CN',78,1
'A1',4,'Source 3','OPA',-114.5,0.037329147
'A2',4,'Source 3','OPA',-114.5,0.037329147
'A3',4,'Source 3','OPA',3028.5,1
'H10',6,'Source 3','ZIII',4245.375,0.85305734
'H11',6,'Source 3','ZIII',5017.375,1
'H12',6,'Source 3','CN',20.375,1
'A1',4,'Source 4','OPA',-183.375,0.017731683
'A2',4,'Source 4','OPA',-102.375,0.044831047
'A3',4,'Source 4','OPA',2752.625,1
'H10',6,'Source 4','ZIIII',2635.75,0.697943562
'H11',6,'Source 4','ZIIII',3878.75,1
'H12',6,'Source 4','CN',-10.25,1
'A1',4,'Source 5','OPA',-236.375,0
'A2',4,'Source 5','OPA',-199.375,0.028094153
'A3',4,'Source 5','OPA',1080.625,1
'H10',6,'Source 5','ZV',3489,0.952202946
'H11',6,'Source 5','ZV',3676,1
'H12',6,'Source 5','CN',31,1
'A1',4,'Source 6','OPA',-221.375,0.008870491
'A2',4,'Source 6','OPA',-150.375,0.050857481
'A3',4,'Source 6','OPA',1454.625,1
'H10',6,'Source 6','ZVI',2224.375,1
'H11',6,'Source 6','ZVI',1418.375,0.672457584
'H12',6,'Source 6','CN',716.375,1
我希望能够添加一个由 Spotfire 方程定义的计算列:
([BkgrdSub Fluorescence] - Min([BkgrdSub Fluorescence])) / Max([BkgrdSub Fluorescence] - Min([BkgrdSub Fluorescence])) OVER ([Treat],[Source],[Assay])
我一步一步地构建脚本,我已经到了尝试使用 groupby() 运行 的地步:
import pandas as pd
df1.insert(5,"Scaled BckgrdSub Fluorescence min","")
df1['Scaled BckgrdSub Fluorescence min'] = df1.groupby(['Treat','Source','Assay'])['BkgrdSub Fluorescence'].transform('min')
df1.insert(6,"Scaled BckgrdSub Fluorescence eq","")
df1['Scaled BckgrdSub Fluorescence eq'] = df1[['BkgrdSub Fluorescence'] - ['Scaled BckgrdSub Fluorescence min']].groupby(df1['Treat'],df1['Source'],df1['Assay']).transform('max')
但是我得到错误:
TypeError: unsupported operand type(s) for -: 'list' and 'list'
我的理解是我不能从列表中减去列表。很明显,语法不支持 groupby() 函数中的方程式。
我还试图通过避免使用 groupby() 来避免此语法错误,其中 'Scaled BckgrdSub Fluorescence' 是所需的结果列:
df1.insert(5,"Scaled BckgrdSub Fluorescence min","")
df1['Scaled BckgrdSub Fluorescence min'] = df1.groupby(['Treat','Source','Assay'])['BkgrdSub Fluorescence'].transform('min')
df1.insert(6,"Scaled BckgrdSub Fluorescence eq","")
df1['Scaled BckgrdSub Fluorescence eq'] = df1['BkgrdSub Fluorescence'] - df1['Scaled BckgrdSub Fluorescence min']
df1.insert(7,"Scaled BckgrdSub Fluorescence max","")
df1['Scaled BckgrdSub Fluorescence max'] = df1.groupby(['Treat','Source','Assay'])['Scaled BckgrdSub Fluorescence eq'].transform('max')
df1.insert(8,"Scaled BckgrdSub Fluorescence","")
df1['Scaled BckgrdSub Fluorescence'] = df1['Scaled BckgrdSub Fluorescence eq'] / df1['Scaled BckgrdSub Fluorescence max']
但是,这与您在 Spotfire 中获得的计算列的结果不同。
计算列的预期输出(取自 Spotfire)已显示在 'Calced' 列中。
所以我的问题是,有没有一种简单的方法可以在几行内使用 groupby() 函数添加我想要的列,同时保持准确?
这是复制目标号码的一种方法:
df1.groupby('Source').apply(lambda x: ((x['BkgrdSub Fluorescence'] - df1['BkgrdSub Fluorescence'].min()) / (x['BkgrdSub Fluorescence'].max() - df1['BkgrdSub Fluorescence'].min())))
结果:
Source1 0 0.035583
1 0.130472
2 1.000000
Source2 3 0.036826
4 0.083432
5 1.000000
Source3 6 0.037329
7 0.037329
8 1.000000
Source4 9 0.017732
10 0.044831
11 1.000000
Source5 12 0.000000
13 0.028094
14 1.000000
Source6 15 0.008870
16 0.050857
17 1.000000
更新了新数据集:
df1.index.name = 'ID' # will be used for sorting
# group and apply equation
grouped = df1.groupby(['Treat','Source','Assay']).apply(lambda x: ((x['BkgrdSub Fluorescence']
- df1['BkgrdSub Fluorescence'].min()) / (x['BkgrdSub Fluorescence'].max()
- df1['BkgrdSub Fluorescence'].min())))
# remove multi-index from grouped, sort on ID, and insert into df1
df1['Calculated'] = grouped.sort_index(level='ID').reset_index(drop=True)
新结果(参见计算列):
Well ID Assay Source Treat BkgrdSub Fluorescence Calced Calculated
ID
0 A1 4 Source 1 OPA -215.750 0.035583 0.035583
1 A2 4 Source 1 OPA -160.750 0.130472 0.130472
2 A3 4 Source 1 OPA 343.250 1.000000 1.000000
3 H10 6 Source 1 OPP 9896.000 1.000000 1.000000
4 H11 6 Source 1 OPP 9892.000 0.999605 0.999605
5 H12 6 Source 1 CN -1.000 1.000000 1.000000
6 A1 4 Source 2 OPA -170.000 0.036826 0.036826
7 A2 4 Source 2 OPA -86.000 0.083432 0.083432
8 A3 4 Source 2 OPA 1566.000 1.000000 1.000000
9 H10 6 Source 2 ZI 4885.000 0.809272 0.809272
10 H11 6 Source 2 ZI 6092.000 1.000000 1.000000
11 H12 6 Source 2 CN 78.000 1.000000 1.000000
12 A1 4 Source 3 OPA -114.500 0.037329 0.037329
13 A2 4 Source 3 OPA -114.500 0.037329 0.037329
14 A3 4 Source 3 OPA 3028.500 1.000000 1.000000
15 H10 6 Source 3 ZIII 4245.375 0.853057 0.853057
16 H11 6 Source 3 ZIII 5017.375 1.000000 1.000000
17 H12 6 Source 3 CN 20.375 1.000000 1.000000
18 A1 4 Source 4 OPA -183.375 0.017732 0.017732
19 A2 4 Source 4 OPA -102.375 0.044831 0.044831
20 A3 4 Source 4 OPA 2752.625 1.000000 1.000000
21 H10 6 Source 4 ZIIII 2635.750 0.697944 0.697944
22 H11 6 Source 4 ZIIII 3878.750 1.000000 1.000000
23 H12 6 Source 4 CN -10.250 1.000000 1.000000
24 A1 4 Source 5 OPA -236.375 0.000000 0.000000
25 A2 4 Source 5 OPA -199.375 0.028094 0.028094
26 A3 4 Source 5 OPA 1080.625 1.000000 1.000000
27 H10 6 Source 5 ZV 3489.000 0.952203 0.952203
28 H11 6 Source 5 ZV 3676.000 1.000000 1.000000
29 H12 6 Source 5 CN 31.000 1.000000 1.000000
30 A1 4 Source 6 OPA -221.375 0.008870 0.008870
31 A2 4 Source 6 OPA -150.375 0.050857 0.050857
32 A3 4 Source 6 OPA 1454.625 1.000000 1.000000
33 H10 6 Source 6 ZVI 2224.375 1.000000 1.000000
34 H11 6 Source 6 ZVI 1418.375 0.672458 0.672458
35 H12 6 Source 6 CN 716.375 1.000000 1.000000
我刚开始使用 pandas,我正在尝试根据 Spotfire 计算列公式
添加一个带有 groupby() 的列假设我有一个 table 具有以下数据 (df1):
'Well ID','Assay','Source','Treat','BkgrdSub Fluorescence','Calced'
'A1',4,'Source 1','OPA',-215.75,0.035583351
'A2',4,'Source 1','OPA',-160.75,0.130472288
'A3',4,'Source 1','OPA',343.25,1
'H10',6,'Source 1','OPP',9896,1
'H11',6,'Source 1','OPP',9892,0.999605226
'H12',6,'Source 1','CN',-1,1
'A1',4,'Source 2','OPA',-170,0.03682641
'A2',4,'Source 2','OPA',-86,0.083431583
'A3',4,'Source 2','OPA',1566,1
'H10',6,'Source 2','ZI',4885,0.809271732
'H11',6,'Source 2','ZI',6092,1
'H12',6,'Source 2','CN',78,1
'A1',4,'Source 3','OPA',-114.5,0.037329147
'A2',4,'Source 3','OPA',-114.5,0.037329147
'A3',4,'Source 3','OPA',3028.5,1
'H10',6,'Source 3','ZIII',4245.375,0.85305734
'H11',6,'Source 3','ZIII',5017.375,1
'H12',6,'Source 3','CN',20.375,1
'A1',4,'Source 4','OPA',-183.375,0.017731683
'A2',4,'Source 4','OPA',-102.375,0.044831047
'A3',4,'Source 4','OPA',2752.625,1
'H10',6,'Source 4','ZIIII',2635.75,0.697943562
'H11',6,'Source 4','ZIIII',3878.75,1
'H12',6,'Source 4','CN',-10.25,1
'A1',4,'Source 5','OPA',-236.375,0
'A2',4,'Source 5','OPA',-199.375,0.028094153
'A3',4,'Source 5','OPA',1080.625,1
'H10',6,'Source 5','ZV',3489,0.952202946
'H11',6,'Source 5','ZV',3676,1
'H12',6,'Source 5','CN',31,1
'A1',4,'Source 6','OPA',-221.375,0.008870491
'A2',4,'Source 6','OPA',-150.375,0.050857481
'A3',4,'Source 6','OPA',1454.625,1
'H10',6,'Source 6','ZVI',2224.375,1
'H11',6,'Source 6','ZVI',1418.375,0.672457584
'H12',6,'Source 6','CN',716.375,1
我希望能够添加一个由 Spotfire 方程定义的计算列:
([BkgrdSub Fluorescence] - Min([BkgrdSub Fluorescence])) / Max([BkgrdSub Fluorescence] - Min([BkgrdSub Fluorescence])) OVER ([Treat],[Source],[Assay])
我一步一步地构建脚本,我已经到了尝试使用 groupby() 运行 的地步:
import pandas as pd
df1.insert(5,"Scaled BckgrdSub Fluorescence min","")
df1['Scaled BckgrdSub Fluorescence min'] = df1.groupby(['Treat','Source','Assay'])['BkgrdSub Fluorescence'].transform('min')
df1.insert(6,"Scaled BckgrdSub Fluorescence eq","")
df1['Scaled BckgrdSub Fluorescence eq'] = df1[['BkgrdSub Fluorescence'] - ['Scaled BckgrdSub Fluorescence min']].groupby(df1['Treat'],df1['Source'],df1['Assay']).transform('max')
但是我得到错误:
TypeError: unsupported operand type(s) for -: 'list' and 'list'
我的理解是我不能从列表中减去列表。很明显,语法不支持 groupby() 函数中的方程式。
我还试图通过避免使用 groupby() 来避免此语法错误,其中 'Scaled BckgrdSub Fluorescence' 是所需的结果列:
df1.insert(5,"Scaled BckgrdSub Fluorescence min","")
df1['Scaled BckgrdSub Fluorescence min'] = df1.groupby(['Treat','Source','Assay'])['BkgrdSub Fluorescence'].transform('min')
df1.insert(6,"Scaled BckgrdSub Fluorescence eq","")
df1['Scaled BckgrdSub Fluorescence eq'] = df1['BkgrdSub Fluorescence'] - df1['Scaled BckgrdSub Fluorescence min']
df1.insert(7,"Scaled BckgrdSub Fluorescence max","")
df1['Scaled BckgrdSub Fluorescence max'] = df1.groupby(['Treat','Source','Assay'])['Scaled BckgrdSub Fluorescence eq'].transform('max')
df1.insert(8,"Scaled BckgrdSub Fluorescence","")
df1['Scaled BckgrdSub Fluorescence'] = df1['Scaled BckgrdSub Fluorescence eq'] / df1['Scaled BckgrdSub Fluorescence max']
但是,这与您在 Spotfire 中获得的计算列的结果不同。
计算列的预期输出(取自 Spotfire)已显示在 'Calced' 列中。
所以我的问题是,有没有一种简单的方法可以在几行内使用 groupby() 函数添加我想要的列,同时保持准确?
这是复制目标号码的一种方法:
df1.groupby('Source').apply(lambda x: ((x['BkgrdSub Fluorescence'] - df1['BkgrdSub Fluorescence'].min()) / (x['BkgrdSub Fluorescence'].max() - df1['BkgrdSub Fluorescence'].min())))
结果:
Source1 0 0.035583
1 0.130472
2 1.000000
Source2 3 0.036826
4 0.083432
5 1.000000
Source3 6 0.037329
7 0.037329
8 1.000000
Source4 9 0.017732
10 0.044831
11 1.000000
Source5 12 0.000000
13 0.028094
14 1.000000
Source6 15 0.008870
16 0.050857
17 1.000000
更新了新数据集:
df1.index.name = 'ID' # will be used for sorting
# group and apply equation
grouped = df1.groupby(['Treat','Source','Assay']).apply(lambda x: ((x['BkgrdSub Fluorescence']
- df1['BkgrdSub Fluorescence'].min()) / (x['BkgrdSub Fluorescence'].max()
- df1['BkgrdSub Fluorescence'].min())))
# remove multi-index from grouped, sort on ID, and insert into df1
df1['Calculated'] = grouped.sort_index(level='ID').reset_index(drop=True)
新结果(参见计算列):
Well ID Assay Source Treat BkgrdSub Fluorescence Calced Calculated
ID
0 A1 4 Source 1 OPA -215.750 0.035583 0.035583
1 A2 4 Source 1 OPA -160.750 0.130472 0.130472
2 A3 4 Source 1 OPA 343.250 1.000000 1.000000
3 H10 6 Source 1 OPP 9896.000 1.000000 1.000000
4 H11 6 Source 1 OPP 9892.000 0.999605 0.999605
5 H12 6 Source 1 CN -1.000 1.000000 1.000000
6 A1 4 Source 2 OPA -170.000 0.036826 0.036826
7 A2 4 Source 2 OPA -86.000 0.083432 0.083432
8 A3 4 Source 2 OPA 1566.000 1.000000 1.000000
9 H10 6 Source 2 ZI 4885.000 0.809272 0.809272
10 H11 6 Source 2 ZI 6092.000 1.000000 1.000000
11 H12 6 Source 2 CN 78.000 1.000000 1.000000
12 A1 4 Source 3 OPA -114.500 0.037329 0.037329
13 A2 4 Source 3 OPA -114.500 0.037329 0.037329
14 A3 4 Source 3 OPA 3028.500 1.000000 1.000000
15 H10 6 Source 3 ZIII 4245.375 0.853057 0.853057
16 H11 6 Source 3 ZIII 5017.375 1.000000 1.000000
17 H12 6 Source 3 CN 20.375 1.000000 1.000000
18 A1 4 Source 4 OPA -183.375 0.017732 0.017732
19 A2 4 Source 4 OPA -102.375 0.044831 0.044831
20 A3 4 Source 4 OPA 2752.625 1.000000 1.000000
21 H10 6 Source 4 ZIIII 2635.750 0.697944 0.697944
22 H11 6 Source 4 ZIIII 3878.750 1.000000 1.000000
23 H12 6 Source 4 CN -10.250 1.000000 1.000000
24 A1 4 Source 5 OPA -236.375 0.000000 0.000000
25 A2 4 Source 5 OPA -199.375 0.028094 0.028094
26 A3 4 Source 5 OPA 1080.625 1.000000 1.000000
27 H10 6 Source 5 ZV 3489.000 0.952203 0.952203
28 H11 6 Source 5 ZV 3676.000 1.000000 1.000000
29 H12 6 Source 5 CN 31.000 1.000000 1.000000
30 A1 4 Source 6 OPA -221.375 0.008870 0.008870
31 A2 4 Source 6 OPA -150.375 0.050857 0.050857
32 A3 4 Source 6 OPA 1454.625 1.000000 1.000000
33 H10 6 Source 6 ZVI 2224.375 1.000000 1.000000
34 H11 6 Source 6 ZVI 1418.375 0.672458 0.672458
35 H12 6 Source 6 CN 716.375 1.000000 1.000000