Pandas 将每一行与参考行进行比较 - 仅限某些列

Question

我在 Python.

中有以下 Pandas Dataframe****

   Temp_Fact Oscillops_read         A         B         C         D         E         F         G         H         I         J
0          A          Today  0.710213  0.222015  0.814710  0.597732  0.634099  0.338913  0.452534  0.698082  0.706486  0.433162
1          B          Today  0.653489  0.452543  0.618755  0.555629  0.490342  0.280299  0.026055  0.138876  0.053148  0.899734
2          A          Aactl  0.129211  0.579690  0.641324  0.615772  0.927384  0.199651  0.652395  0.249467  0.262301  0.049795
3          A            DFE  0.743794  0.355085  0.637794  0.633634  0.810033  0.509244  0.470418  0.972145  0.647222  0.610636
4          C    Real_Mt_Olv  0.724282  0.332965  0.063078  0.004550  0.585398  0.869376  0.232148  0.630162  0.102206  0.232981
5          E         Q_Mont  0.221685  0.224834  0.110734  0.397999  0.814153  0.552924  0.981098  0.536750  0.251941  0.383994
6          D            DFE  0.655386  0.561297  0.305310  0.140998  0.433054  0.118187  0.479206  0.556546  0.556017  0.025070
7          F           Bryo  0.257884  0.228650  0.413149  0.285651  0.814095  0.275627  0.775620  0.392448  0.827725  0.935581
8          C          Aactl  0.017388  0.133848  0.939049  0.159416  0.923788  0.375638  0.331078  0.939089  0.098718  0.785569
9          C          Today  0.197419  0.595253  0.574718  0.373899  0.363200  0.289378  0.698455  0.252657  0.357485  0.020484
10         C           Pars  0.037771  0.683799  0.184114  0.545062  0.857000  0.295918  0.733196  0.613165  0.180642  0.254839
11         B           Pars  0.637346  0.090000  0.848710  0.596883  0.027026  0.792180  0.843743  0.461608  0.552165  0.215250
12         B           Pars  0.768422  0.017828  0.090141  0.108061  0.456734  0.803175  0.454479  0.501713  0.687016  0.625260
13         E       Tomorrow  0.860112  0.532859  0.091641  0.768896  0.635966  0.007211  0.656367  0.053136  0.482367  0.680557
14         D            DFE  0.801734  0.365921  0.243407  0.826373  0.904416  0.062448  0.801726  0.049983  0.433135  0.351150
15         F         Q_Mont  0.360710  0.330745  0.598830  0.582379  0.828019  0.467044  0.287276  0.470980  0.355386  0.404299
16         D      Last_Week  0.867126  0.600093  0.813257  0.005423  0.617543  0.657219  0.635255  0.314910  0.016516  0.689257
17         E      Last_Week  0.551499  0.724981  0.821087  0.175279  0.301397  0.304105  0.379553  0.971244  0.558719  0.154240
18         F           Bryo  0.511370  0.208831  0.260223  0.089106  0.121442  0.120513  0.099722  0.750769  0.860541  0.838855
19         E           Bryo  0.323441  0.663328  0.951847  0.782042  0.909736  0.512978  0.999549  0.225423  0.789240  0.155898
20         C       Tomorrow  0.267086  0.357918  0.562190  0.700404  0.961047  0.513091  0.779268  0.030190  0.460805  0.315814
21         B       Tomorrow  0.951356  0.570077  0.867533  0.365708  0.791373  0.232377  0.478656  0.003857  0.805882  0.989754
22         F          Today  0.963750  0.118826  0.264858  0.571066  0.761669  0.967419  0.565773  0.468971  0.466120  0.174815
23         B      Last_Week  0.291186  0.126748  0.154725  0.527029  0.021485  0.224272  0.259218  0.052286  0.205569  0.617701
24         F          Aactl  0.269308  0.655920  0.595518  0.404817  0.290342  0.447246  0.627082  0.306856  0.868357  0.979879

我还有每列的一系列值：

df_base = df[df['Oscillops_read'] == 'Last_Week']
df_base_val = df_base.mean(axis=0)

如您所见，这是一个 Pandas 系列，它是 Oscillops_read == 'Last_Week' 行的每一列的平均值。这是系列：

[0.56993702256121603, 0.48394061768804786, 0.59635616273775061, 0.23591030688019868, 0.31347492150330231, 0.39519847430740507, 0.42467546792253791, 0.4461465888887961, 0.26026797943899194, 0.48706569569369912]

我还有 2 个列表：

1.

range_name_list = ['Base','Curnt','Prediction','Graph','Swg','Barometer_Output','Test_Cntr']

此列表给出了在特定条件下（如下所述）必须添加到数据框 df 的值。

2.

col_1 = list('DFA')
col_2 = list('ACEF')
col_3 = list('CEF')
col_4 = list('ABDF')
col_5 = list('DEF')
col_6 = list('AC')
col_7 = list('ABCDE')

这些是列名列表。 df 中的这些列必须与上面的平均系列进行比较。因此，例如，对于第 6 个列表 col_6，数据帧 df 每一行的列 A 和 C 必须与列 A 和 [=系列的 23=]。

问题： 正如我上面提到的，我需要将数据帧 df 中的特定列与基础系列 df_base_val 进行比较。 col_1, col_2, col_3, ..., col_7 中列出了要比较的列。这是我需要做的：

如果 col_1 中列出的数据框列名称的行（例如，如果列 A 和 C 的行）大于基础系列 df_base_val 在这 2 列中，然后在该行的新列 Range 中，输入列表中的第 6 个值 range_name_list。

示例： 例如。使用 col_6 - 这是第 6 个列表，它具有列名称 A 和 C。

第 1 步：对于 df 的第 1 行，第 A 和 C 列大于 df_base_val[A] 和 df_base_val[C] 分别。
第 2 步：因此，对于第 1 行，在新列 Range 中，输入列表中的第 6 个元素 range_name_list - 第 6 个元素是 Barometer_Output。

示例输出： 这样做之后，第一行变为：

0          A          Today  0.710213  0.222015  0.814710  0.597732  0.634099  0.338913  0.452534  0.698082  0.706486  0.433162  'Barometer_Output'

现在，如果此行不大于 A 和 C 列中的系列，并且不大于 col_1、[=50] 列中的系列=]，等等，那么必须为 Range 列分配值 'Not_in_Range'。在这种情况下，该行将变为：

0          A          Today  0.710213  0.222015  0.814710  0.597732  0.634099  0.338913  0.452534  0.698082  0.706486  0.433162  'Not_in_Range'

简化和问题： 在这个例子中：

我只将第一行与基础系列进行了比较。我需要比较 df 的所有行分别添加到基础系列并添加适当的值。
我只使用了第 6 个列列表 - 这是 col_6。同样，我需要遍历每个列名列表 - col_1、col_2、....、col_7.
如果被比较的行不大于指定列中的任何列表 col_1 到 col_7，则列 Range 必须是赋值 'Not_in_Range'.

有办法吗？也许使用循环？

**** 创建上面的数据框，select 从上面复制它。然后使用以下代码：

import pandas as pd
df = pd.read_clipboard()
print df

编辑： 如果满足多个条件，我需要将它们全部列出。即，如果该行属于 'Swg' 和 'Curnt'，那么我需要在范围列中列出这两个，或者创建单独的范围列，或者只是 Python 列表，对于每个匹配结果。 Range1 将列出 'Swg'，Range2 列将列出 'Curnt'，等等

Answer 1

对于初学者，我会用你的条件集创建一个字典，其中的键可以用作你的 range_name_list 列表的索引：

conditions = {0: list('DFA'),
              1: list('ACEF'),
              2: list('CEF'),
              3: list('ABDF'),
              4: list('DEF'),
              5: list('AC'),
              6: list('ABCDE')}

下面的代码将完成我所理解的你的任务：

# Create your Range column to be filled in later.
df['Range'] = '|'
for index, row in df.iterrows():
  for ix, list in conditions.iteritems():
    # Create a list of the outcomes of checking whether the
    # value for each condition column is greater than the 
    # df_base_val average.
    truths = [row[column] > df_base_val[column] for column in list]
    # See if all checks evaluated to True
    if sum(truths) == len(truths):
      # If so, set the 'Range' column's value for the current row
      # to the appropriate 'range_name'
      df.ix[index, 'Range'] = df.ix[index, 'Range'] + range_name_list[ix] + "|"
# Fill in all rows where no conditions were met with 'Not_in_Range'
df['Range'][df['Range'] == '|'] = 'Not_in_Range'

Answer 2

试试这个代码：

df = pd.read_csv(BytesIO(txt), delim_whitespace=True)
df_base = df[df['Oscillops_read'] == 'Last_Week']
df_base_val = df_base.mean(axis=0)
columns = ['DFA', 'ACEF', 'CEF', 'ABDF', 'DEF', 'AC', 'ABCDE']
range_name_list = ['Base','Curnt','Prediction','Graph','Swg','Barometer_Output','Test_Cntr']

ranges = pd.Series(["NOT_IN_RANGE" for _ in range(df.shape[0])], index=df.index)

for name, cols in zip(range_name_list, columns):
    cols = list(cols)
    idx = df.index[(df[cols] > df_base_val[cols]).all(axis=1)]
    ranges[idx] = name

print ranges

但我不知道如果一行有多个范围匹配，我不知道你想要什么。

Pandas 将每一行与参考行进行比较 - 仅限某些列

Pandas compare each row to reference row - certain columns only

python

comparison

python-2.7

pandas