Pandas

Question

我有一个看起来像这样的 DF:

             Virus         Host  blastRank  crisprRank  mashRank
0      NC_000866|1  NC_017660|1        1.0         inf       inf
1      NC_000871|1  NC_017595|1        1.0         inf       inf
2      NC_000872|1  NC_017595|1        1.0         inf       inf
3      NC_000896|1  NC_008530|1        1.0         inf       inf
4      NC_000902|1  NC_011353|1        1.0         inf       inf
...            ...          ...        ...         ...       ...
51935  NC_024392|1  NC_021824|1        inf         inf       1.0
51936  NC_024392|1  NC_021829|1        inf         inf       1.0
51937  NC_024392|1  NC_021837|1        inf         inf       1.0
51938  NC_024392|1  NC_021872|1        inf         inf       1.0
51939  NC_024392|1  NC_022737|1        inf         inf       1.0

我想要的是按 Virus 对这个 df 进行分组，并为每个组取每列中等于 min 的行（第一行是列 blastRank 为 min 的行，第二行是 crisprRank 列最小的行等）。如果有多个最小值，那么我想保留所有列。我还必须以某种方式做到这一点，它不仅支持这 3 列（我的程序必须支持 3 个以上的数字列，这就是我使用 df[df.columns.to_list()[2:]][ 的原因=16=]

这是我的代码和它生成的 df：

df = df.groupby(['Virus'], as_index=False).apply(lambda x: x.loc[x[x.columns.to_list()[2:]].idxmin()].reset_index(drop=True))


             Virus         Host  blastRank  crisprRank  mashRank
0   0  NC_000866|1  NC_017660|1        1.0         inf       inf
    1  NC_000866|1  NC_017660|1        1.0         inf       inf
    2  NC_000866|1  NC_002163|1        inf         inf       1.0
1   0  NC_000871|1  NC_017595|1        1.0         inf       inf
    1  NC_000871|1  NC_006449|1        inf         1.0       1.0
...            ...          ...        ...         ...       ...
818 1  NC_024391|1  NC_009641|1        1.0         inf       inf
    2  NC_024391|1  NC_003103|1        inf         inf       1.0
819 0  NC_024392|1  NC_021823|1        1.0         1.0       inf
    1  NC_024392|1  NC_021823|1        1.0         1.0       inf
    2  NC_024392|1  NC_003212|1        inf         inf       1.0

如您所见，idxmin() returns 只是第一个最小值。我想做类似 idxmin(keep='all') 的事情来获得所有关系。

Answer 1

我认为您需要始终测试每组的最小值：

cols = df.columns.to_list()[2:]

f = lambda x: x.apply(lambda x: x[x == x.min()].reset_index(drop=True))
df = df.groupby(['Virus'])[cols].apply(f)

如果需要所有值的原始顺序：

cols = df.columns.to_list()[2:]

f = lambda x: x[cols].where(x[cols].eq(x[cols].min()))
df[cols] = df.groupby(['Virus'], as_index=False).apply(f)
df = df.dropna(subset=cols, how='all')

或者：

df = df.melt(['Virus','Host'])
df1 = df[df.groupby(['Virus','variable'])['value'].transform('min').eq(df['value'])].copy()
df1 = df1.pivot(['Virus','Host'],'variable','value')

print (df1)

Answer 2

这是解决问题的一种方法：

import numpy as np
import pandas as pd
from io import StringIO

data = StringIO("""
             Virus         Host  blastRank  crisprRank  mashRank
0      NC_000866|1  NC_017660|1        1.0         5       8
1      NC_000866|1  NC_017595|1        2.0         4       5
2      NC_000872|1  NC_017595|1        3.0         3       10
3      NC_000872|1  NC_008530|1        4.0         0       3
4      NC_000872|1  NC_011353|1        5.0         1       -3
""")
df = pd.read_csv(data, sep='\s+').convert_dtypes()

cols_of_interest = [c for c in df.columns if c not in ['Virus', 'Host']]

def get_all_min(sdf):    
    sdf_min = sdf.min().to_frame().T
    result = pd.concat([pd.merge(sdf, sdf_min[[c]], how='inner') for c in sdf_min.columns if c in cols_of_interest])
    result = result.drop_duplicates().reset_index(drop=True)
    return result

df.groupby('Virus', as_index=False).apply(get_all_min).reset_index(drop=True)

Pandas - 多列上的 idxmin 保持所有关系

Pandas - idxmin on multiple columns with keeping all ties

python

bioinformatics

python-3.x