获取数据帧每行的第 n 个排名列 ID - Python/Pandas
Getting n-th ranked column IDs per row of a dataframe - Python/Pandas
我正在尝试寻找一种方法来查找排名第 n 的值并 returning 列名。因此,例如,给定一个数据框:
df = pd.DataFrame(np.random.randn(5, 4), columns = list('ABCD'))
# Return column name of "MAX" value, compared to other columns in any particular row.
df['MAX1_NAMES'] = df.idxmax(axis=1)
print df
A B C D MAX1_NAMES
0 -0.728424 -0.764682 -1.506795 0.722246 D
1 1.305500 -1.191558 0.068829 -1.244659 A
2 -0.175834 -0.140273 1.117114 0.817358 C
3 -0.255825 -1.534035 -0.591206 -0.352594 A
4 -2.408806 -1.925055 -1.797020 2.381936 D
这会找到行中的最大值,return 出现它的列名。但我需要这样一种情况,我可以选择所需值的特定等级,并希望得到如下数据框:
A B C D MAX1_NAMES MAX2_NAMES
0 -0.728424 -0.764682 -1.506795 0.722246 D A
1 1.305500 -1.191558 0.068829 -1.244659 A C
2 -0.175834 -0.140273 1.117114 0.817358 C D
3 -0.255825 -1.534035 -0.591206 -0.352594 A D
4 -2.408806 -1.925055 -1.797020 2.381936 D C
其中 MAX2_NAMES
是该行中的第二大值。
谢谢。
您可以每行应用一个 argsort()
,反转索引并在第二个位置选择一个:
df['MAX2_NAMES'] = df.iloc[:,:4].apply(lambda r: r.index[r.argsort()[::-1][1]], axis = 1)
df
# A B C D MAX1_NAMES MAX2_NAMES
#0 -0.728424 -0.764682 -1.506795 0.722246 D A
#1 1.305500 -1.191558 0.068829 -1.244659 A C
#2 -0.175834 -0.140273 1.117114 0.817358 C D
#3 -0.255825 -1.534035 -0.591206 -0.352594 A D
#4 -2.408806 -1.925055 -1.797020 2.381936 D C
您只想对特定排名 n
进行排名,所以我想建议 np.argpartition
that would get sorted indices just for the highest n-ranked entries at each row rather than sorting all elements. This is aimed at improved performance. The performance benefits are discussed in length in answers to A fast way to find the largest N elements in an numpy array
希望我们也能从中受益。
因此,在函数格式中,我们将有 -
def rank_df(df,rank):
coln = 'MAX' + str(rank) + '_NAMES'
sortID = np.argpartition(-df[['A','B','C','D']].values,rank,axis=1)[:,rank-1]
df[coln] = df.columns[sortID]
样本运行-
In [84]: df
Out[84]:
A B C D
0 -0.124851 0.152432 1.436602 -0.391178
1 0.371932 1.732399 0.340876 -1.340609
2 -1.218608 0.444246 0.169968 -1.437259
3 -0.828132 0.821613 -0.556643 -0.407703
4 -0.390477 0.048824 -2.087323 1.597030
In [85]: rank_df(df,1)
In [86]: rank_df(df,2)
In [87]: df
Out[87]:
A B C D MAX1_NAMES MAX2_NAMES
0 -0.124851 0.152432 1.436602 -0.391178 C B
1 0.371932 1.732399 0.340876 -1.340609 B A
2 -1.218608 0.444246 0.169968 -1.437259 B C
3 -0.828132 0.821613 -0.556643 -0.407703 B D
4 -0.390477 0.048824 -2.087323 1.597030 D B
运行时测试
我正在计时基于 np.argpartition
的方法,正如前面列出的 post 和基于 np.argsort
的方法,如@Psidom 在大小合适的数据帧上的另一个解决方案中所列。
In [92]: df = pd.DataFrame(np.random.randn(10000, 4), columns = list('ABCD'))
In [93]: %timeit rank_df(df,2)
100 loops, best of 3: 2.36 ms per loop
In [94]: df = pd.DataFrame(np.random.randn(10000, 4), columns = list('ABCD'))
In [95]: %timeit df['MAX2_NAMES'] = df.iloc[:,:4].apply(lambda r: r.index[r.argsort()[::-1][1]], axis = 1)
1 loops, best of 3: 3.32 s per loop
您可以通过组合 rank、apply 和 idxmin 来做到这一点。
例如:
df = pd.util.testing.makeTimeDataFrame(5)
df
A B C D
2000-01-03 -1.814888 -0.709120 -0.134390 -0.906183
2000-01-04 0.459742 1.235481 0.109602 -0.226923
2000-01-05 -1.567867 0.562368 -1.185567 -2.176161
2000-01-06 0.747989 -0.160384 1.617100 0.242830
2000-01-07 -1.288061 -1.631342 -0.857830 -0.210695
df['rank_2_col'] = df.rank(1).apply(lambda r: r[r==2].idxmin(), axis=1)
df
A B C D rank_2_col
2000-01-03 -1.814888 -0.709120 -0.134390 -0.906183 D
2000-01-04 0.459742 1.235481 0.109602 -0.226923 C
2000-01-05 -1.567867 0.562368 -1.185567 -2.176161 A
2000-01-06 0.747989 -0.160384 1.617100 0.242830 D
2000-01-07 -1.288061 -1.631342 -0.857830 -0.210695 A
我正在尝试寻找一种方法来查找排名第 n 的值并 returning 列名。因此,例如,给定一个数据框:
df = pd.DataFrame(np.random.randn(5, 4), columns = list('ABCD'))
# Return column name of "MAX" value, compared to other columns in any particular row.
df['MAX1_NAMES'] = df.idxmax(axis=1)
print df
A B C D MAX1_NAMES
0 -0.728424 -0.764682 -1.506795 0.722246 D
1 1.305500 -1.191558 0.068829 -1.244659 A
2 -0.175834 -0.140273 1.117114 0.817358 C
3 -0.255825 -1.534035 -0.591206 -0.352594 A
4 -2.408806 -1.925055 -1.797020 2.381936 D
这会找到行中的最大值,return 出现它的列名。但我需要这样一种情况,我可以选择所需值的特定等级,并希望得到如下数据框:
A B C D MAX1_NAMES MAX2_NAMES
0 -0.728424 -0.764682 -1.506795 0.722246 D A
1 1.305500 -1.191558 0.068829 -1.244659 A C
2 -0.175834 -0.140273 1.117114 0.817358 C D
3 -0.255825 -1.534035 -0.591206 -0.352594 A D
4 -2.408806 -1.925055 -1.797020 2.381936 D C
其中 MAX2_NAMES
是该行中的第二大值。
谢谢。
您可以每行应用一个 argsort()
,反转索引并在第二个位置选择一个:
df['MAX2_NAMES'] = df.iloc[:,:4].apply(lambda r: r.index[r.argsort()[::-1][1]], axis = 1)
df
# A B C D MAX1_NAMES MAX2_NAMES
#0 -0.728424 -0.764682 -1.506795 0.722246 D A
#1 1.305500 -1.191558 0.068829 -1.244659 A C
#2 -0.175834 -0.140273 1.117114 0.817358 C D
#3 -0.255825 -1.534035 -0.591206 -0.352594 A D
#4 -2.408806 -1.925055 -1.797020 2.381936 D C
您只想对特定排名 n
进行排名,所以我想建议 np.argpartition
that would get sorted indices just for the highest n-ranked entries at each row rather than sorting all elements. This is aimed at improved performance. The performance benefits are discussed in length in answers to A fast way to find the largest N elements in an numpy array
希望我们也能从中受益。
因此,在函数格式中,我们将有 -
def rank_df(df,rank):
coln = 'MAX' + str(rank) + '_NAMES'
sortID = np.argpartition(-df[['A','B','C','D']].values,rank,axis=1)[:,rank-1]
df[coln] = df.columns[sortID]
样本运行-
In [84]: df
Out[84]:
A B C D
0 -0.124851 0.152432 1.436602 -0.391178
1 0.371932 1.732399 0.340876 -1.340609
2 -1.218608 0.444246 0.169968 -1.437259
3 -0.828132 0.821613 -0.556643 -0.407703
4 -0.390477 0.048824 -2.087323 1.597030
In [85]: rank_df(df,1)
In [86]: rank_df(df,2)
In [87]: df
Out[87]:
A B C D MAX1_NAMES MAX2_NAMES
0 -0.124851 0.152432 1.436602 -0.391178 C B
1 0.371932 1.732399 0.340876 -1.340609 B A
2 -1.218608 0.444246 0.169968 -1.437259 B C
3 -0.828132 0.821613 -0.556643 -0.407703 B D
4 -0.390477 0.048824 -2.087323 1.597030 D B
运行时测试
我正在计时基于 np.argpartition
的方法,正如前面列出的 post 和基于 np.argsort
的方法,如@Psidom 在大小合适的数据帧上的另一个解决方案中所列。
In [92]: df = pd.DataFrame(np.random.randn(10000, 4), columns = list('ABCD'))
In [93]: %timeit rank_df(df,2)
100 loops, best of 3: 2.36 ms per loop
In [94]: df = pd.DataFrame(np.random.randn(10000, 4), columns = list('ABCD'))
In [95]: %timeit df['MAX2_NAMES'] = df.iloc[:,:4].apply(lambda r: r.index[r.argsort()[::-1][1]], axis = 1)
1 loops, best of 3: 3.32 s per loop
您可以通过组合 rank、apply 和 idxmin 来做到这一点。
例如:
df = pd.util.testing.makeTimeDataFrame(5)
df
A B C D
2000-01-03 -1.814888 -0.709120 -0.134390 -0.906183
2000-01-04 0.459742 1.235481 0.109602 -0.226923
2000-01-05 -1.567867 0.562368 -1.185567 -2.176161
2000-01-06 0.747989 -0.160384 1.617100 0.242830
2000-01-07 -1.288061 -1.631342 -0.857830 -0.210695
df['rank_2_col'] = df.rank(1).apply(lambda r: r[r==2].idxmin(), axis=1)
df
A B C D rank_2_col
2000-01-03 -1.814888 -0.709120 -0.134390 -0.906183 D
2000-01-04 0.459742 1.235481 0.109602 -0.226923 C
2000-01-05 -1.567867 0.562368 -1.185567 -2.176161 A
2000-01-06 0.747989 -0.160384 1.617100 0.242830 D
2000-01-07 -1.288061 -1.631342 -0.857830 -0.210695 A