Sorting/Cluster 基于多列的有序序列的 2D numpy 数组

Question

我有一个像这样的 2D numpy 数组：

 [[4 5 2] 
  [5 5 1]
  [5 4 5]
  [5 3 4]
  [5 4 4]
  [4 3 2]]

我想 sort/cluster 这个数组，通过在数组中找到这样的序列 row[0]>=row[1]>=row[2]，row[0]>=row[2]>row[1]... 所以数组的行是有序的。

我尝试使用代码：lexdf = df[np.lexsort((df[:,2], df[:,1],df[:,0]))][::-1]，但这不是我想要的。词法排序的输出：

 [[5 5 1]
  [5 4 5]
  [5 4 4]
  [5 3 4]
  [4 5 2] 
  [4 3 2]]

我想要的输出：

 [[5 5 1]
  [5 4 4]
  [4 3 2]
  [5 4 5]
  [5 3 4]
  [4 5 2]]

或将其分为三部分：

 [[5 5 1]
 [5 4 4]
 [4 3 2]]

 [[5 4 5]
 [5 3 4]]

 [[4 5 2]]

而且我想将其应用于具有更多列的数组，所以最好不要迭代。有生成这种输出的想法吗？

Answer 1

我不知道如何在 numpy 中做到这一点，除非使用一些奇怪的函数 numpy.split。

这是一种使用 python 列表获取群组的方法：

from itertools import groupby, pairwise

def f(sublist):
    return [x <= y for x,y in pairwise(sublist)]

# NOTE: itertools.pairwise requires python>=3.10
# For python<=3.9, use one of those alternatives:
# * more_itertools.pairwise(sublist)
# * zip(sublist, sublist[1:])

a = [[4, 5, 2], 
  [5, 5, 1],
  [5, 4, 5],
  [5, 3, 4],
  [5, 4, 4],
  [4, 3, 2]]

b = [list(g) for _,g in groupby(sorted(a, key=f), key=f)]

print(b)
# [[[4, 3, 2]],
#  [[5, 4, 5], [5, 3, 4], [5, 4, 4]],
#  [[4, 5, 2], [5, 5, 1]]]

注意：组合 groupby+sorted 实际上效率稍低，因为 sorted 需要 n log(n) 时间。线性替代方法是使用列表字典进行分组。例如参见 [=15=].

Sorting/Cluster 基于多列的有序序列的 2D numpy 数组

Sorting/Cluster a 2D numpy array in ordered sequence based on multiple columns

python

arrays

sorting

numpy

sequence