我应该为可以在运行时在 python 中创建的 numpy 数组动态分配内存吗?

Should I dinamically allocate memory for a numpy array that could be created at runtime in cython?

运行流畅且快速:

solexa_scores = '!"#$%&' + "'()*+,-./0123456789:;<=>?@ABCDEFGHI"

cdef np.ndarray[np.uint32_t, ndim=2] sums = np.zeros(shape=(length, len(solexa_scores)+33), dtype=np.uint32) 

cdef bytes line
cdef str decoded_line
cdef int counter=0 # Useful to know if it's the 3rd or 4th line of the current sequence in fastq.
with gzip.open(file_in, "rb") as f:
    for line in f:
    
        if counter%4==0: # first line of the sequence (obtain tile info)
            counter=0
    
        elif counter%3==0: # 3rd line of the sequence (obtain the qualities)
            decoded_line = line.decode('utf-8')
            for n in range(len(decoded_line)): #     enumerate(line.decode('utf-8')):
                sums[n, ord(decoded_line[n])] +=1
                
        counter+=1

此处 numpy ndarray sums 包含结果。

但是,我需要字典中 unknown 个数组(名为 tiles)和这个是应该实现我的目标的代码:

solexa_scores = '!"#$%&' + "'()*+,-./0123456789:;<=>?@ABCDEFGHI"

cdef dict tiles = {} # each tile will have it's own 'sums' numpy array

cdef bytes line
cdef str decoded_line
cdef str tile

cdef int counter=0 # Useful to know if it's the 3rd or 4th line of the current sequence in fastq.
with gzip.open(file_in, "rb") as f:
    for line in f:

        if counter%4==0: # first line of the sequence (obtain tail info)
            decoded_line = line.decode('utf-8')
            tile = decoded_line.split(':')[4]
            if tile != tile_specific and tile not in tiles.keys(): # tile_specific is mentiones elsewhere. 
                tiles[tile] = np.zeros(shape=(length, len(solexa_scores)+33), dtype=np.uint32)

            counter=0

        elif counter%3==0: # 3rd line of the sequence (obtain the qualities)
            decoded_line = line.decode('utf-8')
            for n in range(len(decoded_line)): #     enumerate(line.decode('utf-8')):
                tiles[tile][n, ord(decoded_line[n])] +=1
                
        counter+=1

在第二个例子中,我不知道先验字典中的键数tiles,因此,numpy数组将在运行时声明和初始化(如果我错了或使用了错误的术语,请纠正我)。 在使用 numpy 数组的 cython 声明时,Cython 没有 translate/compile,因此,我将其保留为 tiles[tile] = np.zeros(shape=(length, len(solexa_scores)+33), dtype=np.uint32)。 由于两个代码段之间共享的代码的所有其他 cython 优化都很好,我相信这个 numpy 数组声明是问题所在。

我该如何解决? Here,手册指出了动态分配内存的方法,但我不知道这如何与 numpy 数组一起使用,也不知道我是否应该这样做。

谢谢!

我会忽略有关动态分配内存的文档。这不是您想要做的 - 它在 C 级别非常重要,并且您正在处理 Python 个对象。

您可以轻松地多次重新分配类型为 Numpy 数组(或同样是更新类型的 memoryview)的变量,以便它引用不同的 Numpy 数组。我怀疑你想要的是

# start of function
cdef np.ndarray[np.uint32_t, ndim=2] tile_array

# in "if counter%4==0":
if tile != tile_specific and tile not in tiles.keys(): # tile_specific is mentiones elsewhere. 
    tiles[tile] = np.zeros(shape=(length, len(solexa_scores)+33), dtype=np.uint32)
tile_array = tiles[tile]  # not a copy! Just two references to exactly the same object

# in "if counter%3==0"
tile_array[n, ord(decoded_line[n])] +=1

tile_array = tiles[tile] 只是为了做一些 type-checking 的小成本,所以只有在每次作业之间使用几次 tile_array 才可能是值得的(这很难准确猜测阈值是多少,但要根据您当前的版本进行计时)。