通过 Pandas 计算每一行中与共识的差异
Counting differences from the consensus in each row via Pandas
我有一个如下所示的 DataFrame:
import pandas as pd
df = pd.DataFrame({'A':['a','b','c','d'],'B':['a','b','c','x'],'C':['y','b','c','d']})
df
A B C
0 a a y
1 b b b
2 c c c
3 d x d
我想找出每一行中最常见的字符,并计算出与共识的差异总数:
A B C Consensus
0 a a y a
1 b b b b
2 c c c c
3 d x d d
Total 0 1 1 0
运行 通过循环是一种方法,但似乎效率不高:
consensus = []
for idx in df.index:
consensus.append(df.loc[idx].value_counts().index[0])
df['Consensus'] = consensus
(以此类推)
有没有一种直接的方法来获得共识并计算差异?
您可以使用 mode
来获得共识值:
>>> df.mode(axis=1)
0
0 a
1 b
2 c
3 d
不过请注意文档中的注意事项:
Gets the mode(s) of each element along the axis selected. Empty if nothing has 2+ occurrences. Adds a row for each mode per label, fills in gaps with nan.
Note that there could be multiple values returned for the selected axis (when more than one item share the maximum frequency), which is the reason why a dataframe is returned. If you want to impute missing values with the mode in a dataframe df, you can just do this: df.fillna(df.mode().iloc[0])
要计算每列共识的 差异,您可以与 ne
进行比较,然后求和:
>>> df['consensus'] = df.mode(axis=1)
>>> df.loc[:, 'A':'C'].ne(df['consensus'], axis=0).sum(axis=0)
A 0
B 1
C 1
dtype: int64
我有一个如下所示的 DataFrame:
import pandas as pd
df = pd.DataFrame({'A':['a','b','c','d'],'B':['a','b','c','x'],'C':['y','b','c','d']})
df
A B C
0 a a y
1 b b b
2 c c c
3 d x d
我想找出每一行中最常见的字符,并计算出与共识的差异总数:
A B C Consensus
0 a a y a
1 b b b b
2 c c c c
3 d x d d
Total 0 1 1 0
运行 通过循环是一种方法,但似乎效率不高:
consensus = []
for idx in df.index:
consensus.append(df.loc[idx].value_counts().index[0])
df['Consensus'] = consensus
(以此类推)
有没有一种直接的方法来获得共识并计算差异?
您可以使用 mode
来获得共识值:
>>> df.mode(axis=1)
0
0 a
1 b
2 c
3 d
不过请注意文档中的注意事项:
Gets the mode(s) of each element along the axis selected. Empty if nothing has 2+ occurrences. Adds a row for each mode per label, fills in gaps with nan.
Note that there could be multiple values returned for the selected axis (when more than one item share the maximum frequency), which is the reason why a dataframe is returned. If you want to impute missing values with the mode in a dataframe df, you can just do this: df.fillna(df.mode().iloc[0])
要计算每列共识的 差异,您可以与 ne
进行比较,然后求和:
>>> df['consensus'] = df.mode(axis=1)
>>> df.loc[:, 'A':'C'].ne(df['consensus'], axis=0).sum(axis=0)
A 0
B 1
C 1
dtype: int64