Python

Question

我正在编写一个函数，从稀疏向量中提取最高的 x 值（如果少于 x，则更少的值）。我想像许多函数一样包含一个“就地”选项，如果选项为真，它会删除最高值，如果选项为假，则保留它们。

我的问题是我当前的函数正在覆盖输入向量，而不是保持原样。我不确定为什么会这样。我希望解决我的设计问题的方法是包含一个 if 语句，该语句将使用 copy.copy() 复制输入，但这会引发值错误（ValueError：行索引超出矩阵维度），这没有意义对我来说。

代码：

from scipy.sparse import csr_matrix
import copy

max_loc=20
data=[1,3,3,2,5]
rows=[0]*len(data)
indices=[4,2,8,12,7]

sparse_test=csr_matrix((data, (rows,indices)), shape=(1,max_loc))

print(sparse_test)

def top_x_in_sparse(in_vect,top_x,inplace=False):
    if inplace==True:
        rvect=in_vect
    else:
        rvect=copy.copy(in_vect)
    newmax=top_x
    count=0
    out_list=[]
    while newmax>0:
        newmax=1
        if count<top_x:
            out_list+=[csr_matrix.max(rvect)]
            remove=csr_matrix.argmax(rvect)
            rvect[0,remove]=0
            rvect.eliminate_zeros()
            newmax=csr_matrix.max(rvect)
            count+=1
        else:
            newmax=0
    return out_list

a=top_x_in_sparse(sparse_test,3)

print(a)

print(sparse_test)

我的问题分为两部分：

如何防止此函数覆盖向量？
如何添加就地选项？

Answer 1

您真的只是不想循环播放。使用 .eliminate_zeros() 在每个循环迭代中重新分配这些数组是最慢的事情，但不是不这样做的唯一原因。

import numpy as np
from scipy.sparse import csr_matrix

max_loc=20
data=[1,3,3,2,5]
rows=[0]*len(data)
indices=[4,2,8,12,7]

sparse_test=csr_matrix((data, (rows,indices)), shape=(1,max_loc))

像这样会更好：

def top_x_in_sparse(in_vect,top_x,inplace=False):
    
    n = len(in_vect.data)
    
    if top_x >= n:
        if inplace:
            out_data = in_vect.data.tolist()
            in_vect.data = np.array([], dtype=in_vect.data.dtype)
            in_vect.indices = np.array([], dtype=in_vect.indices.dtype)
            in_vect.indptr = np.array([0, 0], dtype=in_vect.indptr.dtype)
            return out_data
        else:
            return in_vect.data.tolist()
        
    else:
        k = n - top_x
        partition_idx = np.argpartition(in_vect.data, k)

        if inplace:
            out_data = in_vect.data[partition_idx[k:n]].tolist()
            in_vect.data = in_vect.data[partition_idx[0:k]]
            in_vect.indices = in_vect.indices[partition_idx[0:k]]
            in_vect.indptr = np.array([0, len(in_vect.data)], dtype=in_vect.indptr.dtype)            
            return out_data
        else:
            return in_vect.data[partition_idx[k:n]].tolist()

如果您需要对返回值进行排序，您当然也可以这样做。

a=top_x_in_sparse(sparse_test,3,inplace=False)

>>> print(a)
[3, 5, 3]

>>> print(sparse_test)
  (0, 2)    3
  (0, 4)    1
  (0, 7)    5
  (0, 8)    3
  (0, 12)   2

b=top_x_in_sparse(sparse_test,3,inplace=True)

>>> print(b)
[3, 5, 3]

>>> print(sparse_test)
  (0, 4)    1
  (0, 12)   2

另外根据评论中的问题：稀疏数组对象的浅表副本不会复制保存数据的 numpy 数组。稀疏对象仅具有对这些对象的引用。深拷贝会得到它们，但是使用内置的拷贝方法已经知道哪些被引用的东西需要被复制，哪些不需要。

Python - 在我的函数中写一个就地选项 [如何防止覆盖我的输入向量]

Python - writing an in-place option in my function [how to prevent overwriting my input vector]

copy

in-place

scipy

sparse-matrix