我应该为可以在运行时在 python 中创建的 numpy 数组动态分配内存吗?
Should I dinamically allocate memory for a numpy array that could be created at runtime in cython?
运行流畅且快速:
solexa_scores = '!"#$%&' + "'()*+,-./0123456789:;<=>?@ABCDEFGHI"
cdef np.ndarray[np.uint32_t, ndim=2] sums = np.zeros(shape=(length, len(solexa_scores)+33), dtype=np.uint32)
cdef bytes line
cdef str decoded_line
cdef int counter=0 # Useful to know if it's the 3rd or 4th line of the current sequence in fastq.
with gzip.open(file_in, "rb") as f:
for line in f:
if counter%4==0: # first line of the sequence (obtain tile info)
counter=0
elif counter%3==0: # 3rd line of the sequence (obtain the qualities)
decoded_line = line.decode('utf-8')
for n in range(len(decoded_line)): # enumerate(line.decode('utf-8')):
sums[n, ord(decoded_line[n])] +=1
counter+=1
此处 numpy ndarray sums 包含结果。
但是,我需要字典中 unknown 个数组(名为 tiles)和这个是应该实现我的目标的代码:
solexa_scores = '!"#$%&' + "'()*+,-./0123456789:;<=>?@ABCDEFGHI"
cdef dict tiles = {} # each tile will have it's own 'sums' numpy array
cdef bytes line
cdef str decoded_line
cdef str tile
cdef int counter=0 # Useful to know if it's the 3rd or 4th line of the current sequence in fastq.
with gzip.open(file_in, "rb") as f:
for line in f:
if counter%4==0: # first line of the sequence (obtain tail info)
decoded_line = line.decode('utf-8')
tile = decoded_line.split(':')[4]
if tile != tile_specific and tile not in tiles.keys(): # tile_specific is mentiones elsewhere.
tiles[tile] = np.zeros(shape=(length, len(solexa_scores)+33), dtype=np.uint32)
counter=0
elif counter%3==0: # 3rd line of the sequence (obtain the qualities)
decoded_line = line.decode('utf-8')
for n in range(len(decoded_line)): # enumerate(line.decode('utf-8')):
tiles[tile][n, ord(decoded_line[n])] +=1
counter+=1
在第二个例子中,我不知道先验字典中的键数tiles,因此,numpy数组将在运行时声明和初始化(如果我错了或使用了错误的术语,请纠正我)。
在使用 numpy 数组的 cython 声明时,Cython 没有 translate/compile,因此,我将其保留为 tiles[tile] = np.zeros(shape=(length, len(solexa_scores)+33), dtype=np.uint32)
。
由于两个代码段之间共享的代码的所有其他 cython 优化都很好,我相信这个 numpy 数组声明是问题所在。
我该如何解决? Here,手册指出了动态分配内存的方法,但我不知道这如何与 numpy 数组一起使用,也不知道我是否应该这样做。
谢谢!
我会忽略有关动态分配内存的文档。这不是您想要做的 - 它在 C 级别非常重要,并且您正在处理 Python 个对象。
您可以轻松地多次重新分配类型为 Numpy 数组(或同样是更新类型的 memoryview)的变量,以便它引用不同的 Numpy 数组。我怀疑你想要的是
# start of function
cdef np.ndarray[np.uint32_t, ndim=2] tile_array
# in "if counter%4==0":
if tile != tile_specific and tile not in tiles.keys(): # tile_specific is mentiones elsewhere.
tiles[tile] = np.zeros(shape=(length, len(solexa_scores)+33), dtype=np.uint32)
tile_array = tiles[tile] # not a copy! Just two references to exactly the same object
# in "if counter%3==0"
tile_array[n, ord(decoded_line[n])] +=1
tile_array = tiles[tile]
只是为了做一些 type-checking 的小成本,所以只有在每次作业之间使用几次 tile_array
才可能是值得的(这很难准确猜测阈值是多少,但要根据您当前的版本进行计时)。
运行流畅且快速:
solexa_scores = '!"#$%&' + "'()*+,-./0123456789:;<=>?@ABCDEFGHI"
cdef np.ndarray[np.uint32_t, ndim=2] sums = np.zeros(shape=(length, len(solexa_scores)+33), dtype=np.uint32)
cdef bytes line
cdef str decoded_line
cdef int counter=0 # Useful to know if it's the 3rd or 4th line of the current sequence in fastq.
with gzip.open(file_in, "rb") as f:
for line in f:
if counter%4==0: # first line of the sequence (obtain tile info)
counter=0
elif counter%3==0: # 3rd line of the sequence (obtain the qualities)
decoded_line = line.decode('utf-8')
for n in range(len(decoded_line)): # enumerate(line.decode('utf-8')):
sums[n, ord(decoded_line[n])] +=1
counter+=1
此处 numpy ndarray sums 包含结果。
但是,我需要字典中 unknown 个数组(名为 tiles)和这个是应该实现我的目标的代码:
solexa_scores = '!"#$%&' + "'()*+,-./0123456789:;<=>?@ABCDEFGHI"
cdef dict tiles = {} # each tile will have it's own 'sums' numpy array
cdef bytes line
cdef str decoded_line
cdef str tile
cdef int counter=0 # Useful to know if it's the 3rd or 4th line of the current sequence in fastq.
with gzip.open(file_in, "rb") as f:
for line in f:
if counter%4==0: # first line of the sequence (obtain tail info)
decoded_line = line.decode('utf-8')
tile = decoded_line.split(':')[4]
if tile != tile_specific and tile not in tiles.keys(): # tile_specific is mentiones elsewhere.
tiles[tile] = np.zeros(shape=(length, len(solexa_scores)+33), dtype=np.uint32)
counter=0
elif counter%3==0: # 3rd line of the sequence (obtain the qualities)
decoded_line = line.decode('utf-8')
for n in range(len(decoded_line)): # enumerate(line.decode('utf-8')):
tiles[tile][n, ord(decoded_line[n])] +=1
counter+=1
在第二个例子中,我不知道先验字典中的键数tiles,因此,numpy数组将在运行时声明和初始化(如果我错了或使用了错误的术语,请纠正我)。
在使用 numpy 数组的 cython 声明时,Cython 没有 translate/compile,因此,我将其保留为 tiles[tile] = np.zeros(shape=(length, len(solexa_scores)+33), dtype=np.uint32)
。
由于两个代码段之间共享的代码的所有其他 cython 优化都很好,我相信这个 numpy 数组声明是问题所在。
我该如何解决? Here,手册指出了动态分配内存的方法,但我不知道这如何与 numpy 数组一起使用,也不知道我是否应该这样做。
谢谢!
我会忽略有关动态分配内存的文档。这不是您想要做的 - 它在 C 级别非常重要,并且您正在处理 Python 个对象。
您可以轻松地多次重新分配类型为 Numpy 数组(或同样是更新类型的 memoryview)的变量,以便它引用不同的 Numpy 数组。我怀疑你想要的是
# start of function
cdef np.ndarray[np.uint32_t, ndim=2] tile_array
# in "if counter%4==0":
if tile != tile_specific and tile not in tiles.keys(): # tile_specific is mentiones elsewhere.
tiles[tile] = np.zeros(shape=(length, len(solexa_scores)+33), dtype=np.uint32)
tile_array = tiles[tile] # not a copy! Just two references to exactly the same object
# in "if counter%3==0"
tile_array[n, ord(decoded_line[n])] +=1
tile_array = tiles[tile]
只是为了做一些 type-checking 的小成本,所以只有在每次作业之间使用几次 tile_array
才可能是值得的(这很难准确猜测阈值是多少,但要根据您当前的版本进行计时)。