在数据框中的每一行中获取前 n 个值和它们出现的列的名称
Get both the top-n values and the names of columns they occur in, within each row in dataframe
我有一个像这样的数据框:
df = pd.DataFrame({'a':[1,2,1],'b':[4,6,0],'c':[0,4,8]})
+---+---+---+
| a | b | c |
+---+---+---+
| 1 | 4 | 0 |
+---+---+---+
| 2 | 6 | 4 |
+---+---+---+
| 1 | 0 | 8 |
+---+---+---+
对于每一行,我需要(两者)'n'(在本例中为两个)最高值和相应的降序排列的列:
row 1: 'b':4,'a':1
row 2: 'b':6,'c':4
row 3: 'c':8,'a':1
这里有两种方式,都是改编自
1) 使用 Python 修饰-排序-取消修饰,在每一行上加上 .apply(lambda ...)
插入列名,执行 np.argsort,保留顶部-n,重新格式化答案。 (我觉得这样更干净)。
import numpy as np
# Apply Decorate-Sort row-wise to our df, and slice the top-n columns within each row...
sort_decr2_topn = lambda row, nlargest=2:
sorted(pd.Series(zip(df.columns, row)), key=lambda cv: -cv[1]) [:nlargest]
tmp = df.apply(sort_decr2_topn, axis=1)
0 [(b, 4), (a, 1)]
1 [(b, 6), (c, 4)]
2 [(c, 8), (a, 1)]
# then your result (as a pandas DataFrame) is...
np.array(tmp)
array([[('b', 4), ('a', 1)],
[('b', 6), ('c', 4)],
[('c', 8), ('a', 1)]], dtype=object)
# ... or as a list of rows is
tmp.values.tolist()
#... and you can insert the row-indices 0,1,2 with
zip(tmp.index, tmp.values.tolist())
[(0, [('b', 4), ('a', 1), ('c', 0)]), (1, [('b', 6), ('c', 4), ('a', 2)]), (2, [('c', 8), ('a', 1), ('b', 0)])]
2) 如下获取topnlocs
的矩阵,然后用它重新索引到df.columns和df.values,并合并输出:
import numpy as np
nlargest = 2
topnlocs = np.argsort(-df.values, axis=1)[:, 0:nlargest]
# ... now you can use topnlocs to reindex both into df.columns, and df.values, then reformat/combine them somehow
# however it's painful trying to apply that NumPy array of indices back to df or df.values,
参见
我有一个像这样的数据框:
df = pd.DataFrame({'a':[1,2,1],'b':[4,6,0],'c':[0,4,8]})
+---+---+---+
| a | b | c |
+---+---+---+
| 1 | 4 | 0 |
+---+---+---+
| 2 | 6 | 4 |
+---+---+---+
| 1 | 0 | 8 |
+---+---+---+
对于每一行,我需要(两者)'n'(在本例中为两个)最高值和相应的降序排列的列:
row 1: 'b':4,'a':1
row 2: 'b':6,'c':4
row 3: 'c':8,'a':1
这里有两种方式,都是改编自
1) 使用 Python 修饰-排序-取消修饰,在每一行上加上 .apply(lambda ...)
插入列名,执行 np.argsort,保留顶部-n,重新格式化答案。 (我觉得这样更干净)。
import numpy as np
# Apply Decorate-Sort row-wise to our df, and slice the top-n columns within each row...
sort_decr2_topn = lambda row, nlargest=2:
sorted(pd.Series(zip(df.columns, row)), key=lambda cv: -cv[1]) [:nlargest]
tmp = df.apply(sort_decr2_topn, axis=1)
0 [(b, 4), (a, 1)]
1 [(b, 6), (c, 4)]
2 [(c, 8), (a, 1)]
# then your result (as a pandas DataFrame) is...
np.array(tmp)
array([[('b', 4), ('a', 1)],
[('b', 6), ('c', 4)],
[('c', 8), ('a', 1)]], dtype=object)
# ... or as a list of rows is
tmp.values.tolist()
#... and you can insert the row-indices 0,1,2 with
zip(tmp.index, tmp.values.tolist())
[(0, [('b', 4), ('a', 1), ('c', 0)]), (1, [('b', 6), ('c', 4), ('a', 2)]), (2, [('c', 8), ('a', 1), ('b', 0)])]
2) 如下获取topnlocs
的矩阵,然后用它重新索引到df.columns和df.values,并合并输出:
import numpy as np
nlargest = 2
topnlocs = np.argsort(-df.values, axis=1)[:, 0:nlargest]
# ... now you can use topnlocs to reindex both into df.columns, and df.values, then reformat/combine them somehow
# however it's painful trying to apply that NumPy array of indices back to df or df.values,
参见