通过 Pandas 计算每一行中与共识的差异

Question

我有一个如下所示的 DataFrame：

import pandas as pd
df = pd.DataFrame({'A':['a','b','c','d'],'B':['a','b','c','x'],'C':['y','b','c','d']})
df

   A  B  C
0  a  a  y
1  b  b  b
2  c  c  c
3  d  x  d

我想找出每一行中最常见的字符，并计算出与共识的差异总数：

       A  B  C Consensus
    0  a  a  y         a
    1  b  b  b         b
    2  c  c  c         c
    3  d  x  d         d
Total  0  1  1         0

运行通过循环是一种方法，但似乎效率不高：

consensus = []
for idx in df.index:
    consensus.append(df.loc[idx].value_counts().index[0])
df['Consensus'] = consensus

（以此类推）

有没有一种直接的方法来获得共识并计算差异？

Answer 1

您可以使用 mode 来获得共识值：

>>> df.mode(axis=1)
   0
0  a
1  b
2  c
3  d

不过请注意文档中的注意事项：

Gets the mode(s) of each element along the axis selected. Empty if nothing has 2+ occurrences. Adds a row for each mode per label, fills in gaps with nan.

Note that there could be multiple values returned for the selected axis (when more than one item share the maximum frequency), which is the reason why a dataframe is returned. If you want to impute missing values with the mode in a dataframe df, you can just do this: df.fillna(df.mode().iloc[0])

要计算每列共识的差异，您可以与 ne 进行比较，然后求和：

>>> df['consensus'] = df.mode(axis=1)
>>> df.loc[:, 'A':'C'].ne(df['consensus'], axis=0).sum(axis=0)
A    0
B    1
C    1
dtype: int64

通过 Pandas 计算每一行中与共识的差异

Counting differences from the consensus in each row via Pandas

python

dataframe

consensus

pandas