使用 pandas groupby() 添加方程的计算列

Question

我刚开始使用 pandas，我正在尝试根据 Spotfire 计算列公式

添加一个带有 groupby() 的列

假设我有一个 table 具有以下数据 (df1)：

'Well ID','Assay','Source','Treat','BkgrdSub Fluorescence','Calced'
'A1',4,'Source 1','OPA',-215.75,0.035583351
'A2',4,'Source 1','OPA',-160.75,0.130472288
'A3',4,'Source 1','OPA',343.25,1
'H10',6,'Source 1','OPP',9896,1
'H11',6,'Source 1','OPP',9892,0.999605226
'H12',6,'Source 1','CN',-1,1
'A1',4,'Source 2','OPA',-170,0.03682641
'A2',4,'Source 2','OPA',-86,0.083431583
'A3',4,'Source 2','OPA',1566,1
'H10',6,'Source 2','ZI',4885,0.809271732
'H11',6,'Source 2','ZI',6092,1
'H12',6,'Source 2','CN',78,1
'A1',4,'Source 3','OPA',-114.5,0.037329147
'A2',4,'Source 3','OPA',-114.5,0.037329147
'A3',4,'Source 3','OPA',3028.5,1
'H10',6,'Source 3','ZIII',4245.375,0.85305734
'H11',6,'Source 3','ZIII',5017.375,1
'H12',6,'Source 3','CN',20.375,1
'A1',4,'Source 4','OPA',-183.375,0.017731683
'A2',4,'Source 4','OPA',-102.375,0.044831047
'A3',4,'Source 4','OPA',2752.625,1
'H10',6,'Source 4','ZIIII',2635.75,0.697943562
'H11',6,'Source 4','ZIIII',3878.75,1
'H12',6,'Source 4','CN',-10.25,1
'A1',4,'Source 5','OPA',-236.375,0
'A2',4,'Source 5','OPA',-199.375,0.028094153
'A3',4,'Source 5','OPA',1080.625,1
'H10',6,'Source 5','ZV',3489,0.952202946
'H11',6,'Source 5','ZV',3676,1
'H12',6,'Source 5','CN',31,1
'A1',4,'Source 6','OPA',-221.375,0.008870491
'A2',4,'Source 6','OPA',-150.375,0.050857481
'A3',4,'Source 6','OPA',1454.625,1
'H10',6,'Source 6','ZVI',2224.375,1
'H11',6,'Source 6','ZVI',1418.375,0.672457584
'H12',6,'Source 6','CN',716.375,1

我希望能够添加一个由 Spotfire 方程定义的计算列：

([BkgrdSub Fluorescence] - Min([BkgrdSub Fluorescence])) / Max([BkgrdSub Fluorescence] - Min([BkgrdSub Fluorescence])) OVER ([Treat],[Source],[Assay])

我一步一步地构建脚本，我已经到了尝试使用 groupby() 运行的地步：

import pandas as pd

df1.insert(5,"Scaled BckgrdSub Fluorescence min","")
df1['Scaled BckgrdSub Fluorescence min'] = df1.groupby(['Treat','Source','Assay'])['BkgrdSub Fluorescence'].transform('min')
df1.insert(6,"Scaled BckgrdSub Fluorescence eq","")
df1['Scaled BckgrdSub Fluorescence eq'] = df1[['BkgrdSub Fluorescence'] - ['Scaled BckgrdSub Fluorescence min']].groupby(df1['Treat'],df1['Source'],df1['Assay']).transform('max')

但是我得到错误：

TypeError: unsupported operand type(s) for -: 'list' and 'list'

我的理解是我不能从列表中减去列表。很明显，语法不支持 groupby() 函数中的方程式。

我还试图通过避免使用 groupby() 来避免此语法错误，其中 'Scaled BckgrdSub Fluorescence' 是所需的结果列：

df1.insert(5,"Scaled BckgrdSub Fluorescence min","")
df1['Scaled BckgrdSub Fluorescence min'] = df1.groupby(['Treat','Source','Assay'])['BkgrdSub Fluorescence'].transform('min')
df1.insert(6,"Scaled BckgrdSub Fluorescence eq","")
df1['Scaled BckgrdSub Fluorescence eq'] = df1['BkgrdSub Fluorescence'] - df1['Scaled BckgrdSub Fluorescence min']
df1.insert(7,"Scaled BckgrdSub Fluorescence max","")
df1['Scaled BckgrdSub Fluorescence max'] = df1.groupby(['Treat','Source','Assay'])['Scaled BckgrdSub Fluorescence eq'].transform('max')
df1.insert(8,"Scaled BckgrdSub Fluorescence","")
df1['Scaled BckgrdSub Fluorescence'] = df1['Scaled BckgrdSub Fluorescence eq'] / df1['Scaled BckgrdSub Fluorescence max']

但是，这与您在 Spotfire 中获得的计算列的结果不同。

计算列的预期输出（取自 Spotfire）已显示在 'Calced' 列中。

所以我的问题是，有没有一种简单的方法可以在几行内使用 groupby() 函数添加我想要的列，同时保持准确？

Answer 1

这是复制目标号码的一种方法：

df1.groupby('Source').apply(lambda x: ((x['BkgrdSub Fluorescence'] - df1['BkgrdSub Fluorescence'].min()) / (x['BkgrdSub Fluorescence'].max() - df1['BkgrdSub Fluorescence'].min())))

结果：

Source1  0     0.035583
         1     0.130472
         2     1.000000
Source2  3     0.036826
         4     0.083432
         5     1.000000
Source3  6     0.037329
         7     0.037329
         8     1.000000
Source4  9     0.017732
         10    0.044831
         11    1.000000
Source5  12    0.000000
         13    0.028094
         14    1.000000
Source6  15    0.008870
         16    0.050857
         17    1.000000

更新了新数据集：

df1.index.name = 'ID' # will be used for sorting
# group and apply equation
grouped = df1.groupby(['Treat','Source','Assay']).apply(lambda x: ((x['BkgrdSub Fluorescence'] 
      - df1['BkgrdSub Fluorescence'].min()) / (x['BkgrdSub Fluorescence'].max() 
      - df1['BkgrdSub Fluorescence'].min())))
# remove multi-index from grouped, sort on ID, and insert into df1
df1['Calculated'] = grouped.sort_index(level='ID').reset_index(drop=True)

新结果（参见计算列）：

   Well ID  Assay    Source  Treat  BkgrdSub Fluorescence    Calced   Calculated
ID                                                                              
0       A1      4  Source 1    OPA               -215.750  0.035583     0.035583
1       A2      4  Source 1    OPA               -160.750  0.130472     0.130472
2       A3      4  Source 1    OPA                343.250  1.000000     1.000000
3      H10      6  Source 1    OPP               9896.000  1.000000     1.000000
4      H11      6  Source 1    OPP               9892.000  0.999605     0.999605
5      H12      6  Source 1     CN                 -1.000  1.000000     1.000000
6       A1      4  Source 2    OPA               -170.000  0.036826     0.036826
7       A2      4  Source 2    OPA                -86.000  0.083432     0.083432
8       A3      4  Source 2    OPA               1566.000  1.000000     1.000000
9      H10      6  Source 2     ZI               4885.000  0.809272     0.809272
10     H11      6  Source 2     ZI               6092.000  1.000000     1.000000
11     H12      6  Source 2     CN                 78.000  1.000000     1.000000
12      A1      4  Source 3    OPA               -114.500  0.037329     0.037329
13      A2      4  Source 3    OPA               -114.500  0.037329     0.037329
14      A3      4  Source 3    OPA               3028.500  1.000000     1.000000
15     H10      6  Source 3   ZIII               4245.375  0.853057     0.853057
16     H11      6  Source 3   ZIII               5017.375  1.000000     1.000000
17     H12      6  Source 3     CN                 20.375  1.000000     1.000000
18      A1      4  Source 4    OPA               -183.375  0.017732     0.017732
19      A2      4  Source 4    OPA               -102.375  0.044831     0.044831
20      A3      4  Source 4    OPA               2752.625  1.000000     1.000000
21     H10      6  Source 4  ZIIII               2635.750  0.697944     0.697944
22     H11      6  Source 4  ZIIII               3878.750  1.000000     1.000000
23     H12      6  Source 4     CN                -10.250  1.000000     1.000000
24      A1      4  Source 5    OPA               -236.375  0.000000     0.000000
25      A2      4  Source 5    OPA               -199.375  0.028094     0.028094
26      A3      4  Source 5    OPA               1080.625  1.000000     1.000000
27     H10      6  Source 5     ZV               3489.000  0.952203     0.952203
28     H11      6  Source 5     ZV               3676.000  1.000000     1.000000
29     H12      6  Source 5     CN                 31.000  1.000000     1.000000
30      A1      4  Source 6    OPA               -221.375  0.008870     0.008870
31      A2      4  Source 6    OPA               -150.375  0.050857     0.050857
32      A3      4  Source 6    OPA               1454.625  1.000000     1.000000
33     H10      6  Source 6    ZVI               2224.375  1.000000     1.000000
34     H11      6  Source 6    ZVI               1418.375  0.672458     0.672458
35     H12      6  Source 6     CN                716.375  1.000000     1.000000

使用 pandas groupby() 添加方程的计算列

Adding a calculated column of an equation using pandas groupby()

python

dataframe

pandas

spotfire

pandas-groupby