比较两个不同大小的矩阵以形成一个大矩阵 - 速度改进?
Compare two different size matrices to make one large matrix - Speed Improvements?
我有两个矩阵,我需要用它们来创建更大的矩阵。每个矩阵只是一个被读取的 tab-delimited 文本文件。每个矩阵有 48 个列,每个矩阵具有相同的标识符,但行数不同。第一个矩阵是 108887x48,第二个是 55482x48。对于每个矩阵,每个位置的条目可以是 0 或 1,因此是二进制的。最终输出应该将第一个矩阵行 ID 作为行,将第二个矩阵行 ID 作为列,因此最终矩阵为 55482x10887。
这里需要做的是,对于第一个矩阵中每一行的每个pos,对于第二个矩阵中的每一行,如果每个矩阵的pos(col)为1,那么最终的矩阵计数将去up 1. 最终矩阵中任意pos的最高值可以是48,预计会剩下0。
示例:
mat1
A B C D
1id1 0 1 0 1
1id2 1 1 0 0
1id3 1 1 1 1
1id4 0 0 1 0
mat2
A B C D
2id1 1 1 0 0
2id2 0 1 1 0
2id3 1 1 1 1
2id4 1 0 1 0
final
2id1 2id2 2id3 2id4
1id1 1 1 2 0
1id2 2 1 2 1
1id3 2 2 4 2
1id4 0 1 1 1
我有执行此操作的代码,但速度慢得令人痛苦,这是我主要寻求帮助的地方。我试图尽可能地加快算法的速度。已经 运行ning 了 24 小时,只完成了大约 25%。我之前让它运行通过,最终输出文件是20GB。我没有使用数据库的经验,并且可以在这里实现它,如果有人可以在给出下面的代码片段的情况下帮助我如何做到这一点。
#!/usr/bin/env python
import sys
mat1in = sys.argv[1]
mat2in = sys.argv[2]
print '\n######################################################################################'
print 'Generating matrix by counts from smaller matrices.'
print '########################################################################################\n'
with open(mat1in, 'r') as f:
cols = [''] + next(f).strip().split('\t') # First line of matrix is composed of 48 cols
mat1 = [line.strip().split('\t') for line in f] # Each line in matrix = 'ID': 0 or 1 per col id
with open(mat2in, 'r') as f:
next(f) # Skip first row, col IDs are taken from mat1
mat2 = [line.strip().split('\t') for line in f] # Each line in matrix = 'ID': 0 or 1 per col id
out = open('final_matrix.txt', 'w') # Output file
#matrix = []
header = [] # Final matrix header
header.append('') # Add blank as first char in large matrix header
for i in mat2:
header.append(i[0]) # Composed of all mat2 row ids
#matrix.append(header)
print >> out, '\t'.join(header) # First print header to output file
print '\nTotal mat1 rows: ' + str(len(mat1)) # Get total mat1 rows
print 'Total mat2 rows: ' + str(len(mat2)), '\n' # Get total mat2 rows
print 'Progress: ' # Progress updated as each mat1 id is read
length = len(header) # Length of header, i.e. total number of mat2 ids
totmat1 = len(mat1) # Length of rows (-header), i.e. total number of mat1 ids
total = 0 # Running total - for progress meter
for h in mat1: # Loop through all mat1 ids - each row in the HC matrix
row = [] # Empty list for new row for large matrix
row.append(h[0]) # Append mat1 id, as first item in each row
for i in xrange(length-1): # For length of large matrix header (add 0 to each row) - header -1 for first '\t'
row.extend('0')
for n in xrange(1,49): # Loop through each col id
for k in mat2: # For every row in mat2
if int(h[n]) == 1 and int(k[n]) == 1: # If the pos (count for that particular col id) is 1 from mat1 and mat2 matrix;
pos = header.index(k[0]) # Get the position of the mat2 id
row[pos] = str(int(row[pos]) + 1) # Add 1 to current position in row - [i][j] = [mat1_id][mat2_id]
print >> out, '\t'.join(row) # When row is completed (All columns are compared from both mat1 and mat2 matrices; print final row to large matrix
total += 1 # Update running total
sys.stdout.write('\r\t' + str(total) + '/' + str(tvh)) # Print progress to screen
sys.stdout.flush()
print '\n######################################################################################'
print 'Matrix complete.'
print '########################################################################################\n'
以下是对 mat1 中的 id 的前 30 次迭代的概要分析:
######################################################################################
Generating matrix by counts from smaller matrices.
########################################################################################
Total mat1 rows: 108887
Total mat2 rows: 55482
Progress:
30/108887^C 2140074 function calls in 101.234 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 70.176 70.176 101.234 101.234 build_matrix.py:3(<module>)
4 0.000 0.000 0.000 0.000 {len}
55514 0.006 0.000 0.006 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1719942 1.056 0.000 1.056 0.000 {method 'extend' of 'list' objects}
30 0.000 0.000 0.000 0.000 {method 'flush' of 'file' objects}
35776 29.332 0.001 29.332 0.001 {method 'index' of 'list' objects}
31 0.037 0.001 0.037 0.001 {method 'join' of 'str' objects}
164370 0.589 0.000 0.589 0.000 {method 'split' of 'str' objects}
164370 0.033 0.000 0.033 0.000 {method 'strip' of 'str' objects}
30 0.000 0.000 0.000 0.000 {method 'write' of 'file' objects}
2 0.000 0.000 0.000 0.000 {next}
3 0.004 0.001 0.004 0.001 {open}
我还为每次迭代计时,每个 mat1 id 大约需要 2.5-3 秒,如果我是正确的,则需要大约 90 小时才能完成整个过程。这就是 运行 整个脚本的全部内容。
我编辑了一些主要位,删除了通过追加和 xrange 生成行,通过将“0”乘以 headers 的长度来一步生成列表。我还用索引作为值制作了一个 mat2 id 的字典,认为 dict 查找会比索引更快..
headdict = {}
for k,v in enumerate(header):
headdict[v] = k
total = 0 # Running total - for progress meter
for h in mat1: # Loop through all mat1 ids - each row in the HC matrix
timestart = time.clock()
row = [h[0]] + ['0']*(length-1) # Empty list for new row for large matrix
#row.append(h[0]) # Append mat1 id, as first item in each row
#for i in xrange(length-1): # For length of large matrix header (add 0 to each row) - header -1 for first '\t'
# row.append('0')
for n in xrange(1,49): # Loop through each col id
for k in mat2: # For every row in mat2
if int(h[n]) == 1 and int(k[n]) == 1: # If the pos (count for that particular col id) is 1 from mat1 and mat2 matrix;
pos = headdict[k[0]] #header.index(k[0]) # Get the position of the mat2 id
row[pos] = str(int(row[pos]) + 1) # Add 1 to current position in row - [i][j] = [mat1_id][mat2_id]
print >> out, '\t'.join(row) # When row is completed (All columns are compared from both mat1 and mat2 matrices; print final row to large matrix
total += 1 # Update running total
sys.stdout.write('\r\t' + str(total) + '/' + str(totmat1)) # Print progress to screen
#sys.stdout.flush()
timeend = time.clock()
print timestart - timeend
我不太明白这段代码的作用(单字母变量名没有帮助)。
我的建议:尽量减少最内层循环的操作次数。比如内层是否需要重新计算pos
?
pos = header.index(k[0])
如果可以对嵌套循环 k
、h
和 n
重新排序,您也许可以减少成本高昂的 list.index
,这是一个 O (n) 操作。
这只是一个矩阵乘法。您想要将第一个矩阵乘以第二个矩阵的转置。对于高效的矩阵运算,得到 NumPy.
如果您将两个输入矩阵读入 dtype numpy.int8
的 NumPy 数组,那么计算很简单
m1.dot(m2.T)
最多需要几分钟。
我有两个矩阵,我需要用它们来创建更大的矩阵。每个矩阵只是一个被读取的 tab-delimited 文本文件。每个矩阵有 48 个列,每个矩阵具有相同的标识符,但行数不同。第一个矩阵是 108887x48,第二个是 55482x48。对于每个矩阵,每个位置的条目可以是 0 或 1,因此是二进制的。最终输出应该将第一个矩阵行 ID 作为行,将第二个矩阵行 ID 作为列,因此最终矩阵为 55482x10887。
这里需要做的是,对于第一个矩阵中每一行的每个pos,对于第二个矩阵中的每一行,如果每个矩阵的pos(col)为1,那么最终的矩阵计数将去up 1. 最终矩阵中任意pos的最高值可以是48,预计会剩下0。
示例:
mat1
A B C D
1id1 0 1 0 1
1id2 1 1 0 0
1id3 1 1 1 1
1id4 0 0 1 0
mat2
A B C D
2id1 1 1 0 0
2id2 0 1 1 0
2id3 1 1 1 1
2id4 1 0 1 0
final
2id1 2id2 2id3 2id4
1id1 1 1 2 0
1id2 2 1 2 1
1id3 2 2 4 2
1id4 0 1 1 1
我有执行此操作的代码,但速度慢得令人痛苦,这是我主要寻求帮助的地方。我试图尽可能地加快算法的速度。已经 运行ning 了 24 小时,只完成了大约 25%。我之前让它运行通过,最终输出文件是20GB。我没有使用数据库的经验,并且可以在这里实现它,如果有人可以在给出下面的代码片段的情况下帮助我如何做到这一点。
#!/usr/bin/env python
import sys
mat1in = sys.argv[1]
mat2in = sys.argv[2]
print '\n######################################################################################'
print 'Generating matrix by counts from smaller matrices.'
print '########################################################################################\n'
with open(mat1in, 'r') as f:
cols = [''] + next(f).strip().split('\t') # First line of matrix is composed of 48 cols
mat1 = [line.strip().split('\t') for line in f] # Each line in matrix = 'ID': 0 or 1 per col id
with open(mat2in, 'r') as f:
next(f) # Skip first row, col IDs are taken from mat1
mat2 = [line.strip().split('\t') for line in f] # Each line in matrix = 'ID': 0 or 1 per col id
out = open('final_matrix.txt', 'w') # Output file
#matrix = []
header = [] # Final matrix header
header.append('') # Add blank as first char in large matrix header
for i in mat2:
header.append(i[0]) # Composed of all mat2 row ids
#matrix.append(header)
print >> out, '\t'.join(header) # First print header to output file
print '\nTotal mat1 rows: ' + str(len(mat1)) # Get total mat1 rows
print 'Total mat2 rows: ' + str(len(mat2)), '\n' # Get total mat2 rows
print 'Progress: ' # Progress updated as each mat1 id is read
length = len(header) # Length of header, i.e. total number of mat2 ids
totmat1 = len(mat1) # Length of rows (-header), i.e. total number of mat1 ids
total = 0 # Running total - for progress meter
for h in mat1: # Loop through all mat1 ids - each row in the HC matrix
row = [] # Empty list for new row for large matrix
row.append(h[0]) # Append mat1 id, as first item in each row
for i in xrange(length-1): # For length of large matrix header (add 0 to each row) - header -1 for first '\t'
row.extend('0')
for n in xrange(1,49): # Loop through each col id
for k in mat2: # For every row in mat2
if int(h[n]) == 1 and int(k[n]) == 1: # If the pos (count for that particular col id) is 1 from mat1 and mat2 matrix;
pos = header.index(k[0]) # Get the position of the mat2 id
row[pos] = str(int(row[pos]) + 1) # Add 1 to current position in row - [i][j] = [mat1_id][mat2_id]
print >> out, '\t'.join(row) # When row is completed (All columns are compared from both mat1 and mat2 matrices; print final row to large matrix
total += 1 # Update running total
sys.stdout.write('\r\t' + str(total) + '/' + str(tvh)) # Print progress to screen
sys.stdout.flush()
print '\n######################################################################################'
print 'Matrix complete.'
print '########################################################################################\n'
以下是对 mat1 中的 id 的前 30 次迭代的概要分析:
######################################################################################
Generating matrix by counts from smaller matrices.
########################################################################################
Total mat1 rows: 108887
Total mat2 rows: 55482
Progress:
30/108887^C 2140074 function calls in 101.234 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 70.176 70.176 101.234 101.234 build_matrix.py:3(<module>)
4 0.000 0.000 0.000 0.000 {len}
55514 0.006 0.000 0.006 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1719942 1.056 0.000 1.056 0.000 {method 'extend' of 'list' objects}
30 0.000 0.000 0.000 0.000 {method 'flush' of 'file' objects}
35776 29.332 0.001 29.332 0.001 {method 'index' of 'list' objects}
31 0.037 0.001 0.037 0.001 {method 'join' of 'str' objects}
164370 0.589 0.000 0.589 0.000 {method 'split' of 'str' objects}
164370 0.033 0.000 0.033 0.000 {method 'strip' of 'str' objects}
30 0.000 0.000 0.000 0.000 {method 'write' of 'file' objects}
2 0.000 0.000 0.000 0.000 {next}
3 0.004 0.001 0.004 0.001 {open}
我还为每次迭代计时,每个 mat1 id 大约需要 2.5-3 秒,如果我是正确的,则需要大约 90 小时才能完成整个过程。这就是 运行 整个脚本的全部内容。
我编辑了一些主要位,删除了通过追加和 xrange 生成行,通过将“0”乘以 headers 的长度来一步生成列表。我还用索引作为值制作了一个 mat2 id 的字典,认为 dict 查找会比索引更快..
headdict = {}
for k,v in enumerate(header):
headdict[v] = k
total = 0 # Running total - for progress meter
for h in mat1: # Loop through all mat1 ids - each row in the HC matrix
timestart = time.clock()
row = [h[0]] + ['0']*(length-1) # Empty list for new row for large matrix
#row.append(h[0]) # Append mat1 id, as first item in each row
#for i in xrange(length-1): # For length of large matrix header (add 0 to each row) - header -1 for first '\t'
# row.append('0')
for n in xrange(1,49): # Loop through each col id
for k in mat2: # For every row in mat2
if int(h[n]) == 1 and int(k[n]) == 1: # If the pos (count for that particular col id) is 1 from mat1 and mat2 matrix;
pos = headdict[k[0]] #header.index(k[0]) # Get the position of the mat2 id
row[pos] = str(int(row[pos]) + 1) # Add 1 to current position in row - [i][j] = [mat1_id][mat2_id]
print >> out, '\t'.join(row) # When row is completed (All columns are compared from both mat1 and mat2 matrices; print final row to large matrix
total += 1 # Update running total
sys.stdout.write('\r\t' + str(total) + '/' + str(totmat1)) # Print progress to screen
#sys.stdout.flush()
timeend = time.clock()
print timestart - timeend
我不太明白这段代码的作用(单字母变量名没有帮助)。
我的建议:尽量减少最内层循环的操作次数。比如内层是否需要重新计算pos
?
pos = header.index(k[0])
如果可以对嵌套循环 k
、h
和 n
重新排序,您也许可以减少成本高昂的 list.index
,这是一个 O (n) 操作。
这只是一个矩阵乘法。您想要将第一个矩阵乘以第二个矩阵的转置。对于高效的矩阵运算,得到 NumPy.
如果您将两个输入矩阵读入 dtype numpy.int8
的 NumPy 数组,那么计算很简单
m1.dot(m2.T)
最多需要几分钟。