使用 python 解析大文件以查找找到的值
Parse big file for found values using python
我有两个文件:
- fileA 有 20,000 行
- fileB 有 16000000 行
我想比较文件 A 中的行 [3] 和文件 B 中的行 [1]。
flieA格式:
1 i713426 0 726912 0 0
1 i713449 0 830731 0 0
1 i707010 0 1183442 0 A
1 i713034 0 1225231 0 G
1 i703639 0 1267327 I D
1 i713057 0 1425512 0 T
1 i713129 0 1501061 0 G
1 i707027 0 1542721 0 C
1 i713163 0 1680617 0 C
1 i707055 0 1884055 0 C
1 i713254 0 2145254 0 C
1 i713324 0 2486696 0 C
1 i6059967 0 2526746 G A
1 i713334 0 2626131 0 0
1 i713335 0 2692373 0 C
1 i713341 0 3043138 0 A
1 i707150 0 3216645 0 0
1 i713347 0 3277176 0 G
fileB 来自于
chr1 87190 rs1524602 A/G 0.4358974358974359 0.8
chr1 87204 rs866881507 A/G 0.02564102564102564 0.2
chr1 87234 rs533355948 C/T 0.02564102564102564 0.2
chr1 87236 rs879825293 C/T 0.05128205128205128 0.2
chr1 87256 rs373216495 C/T 0.05128205128205128 0.6
chr1 87259 rs570089526 A/G 0.05128205128205128 0.6
chr1 87302 rs529420236 C/T 0.02564102564102564 0.2
chr1 87303 rs2103135 A/G 0.1282051282051282 0.4
chr1 87304 rs550004764 A/G 0.02564102564102564 0.2
chr1 87351 rs549570359 C/T 0.02564102564102564 0.2
chr1 87360 rs180907504 C/T 0.15384615384615385 0.6
chr1 87361 rs535266627 A/G 0.02564102564102564 0.4
chr1 87366 rs558417557 A/G 0.02564102564102564 0.4
chr1 87373 rs963638476 A/G 0.02564102564102564 0.2
chr1 87374 rs974579646 A/C 0.02564102564102564 0.2
如果文件 A 的行 [3] 等于文件 B 的行 [1] 则输出 print
i713426 rs567161598
i713449 rs547376081
i707010 rs566056983
i713034 rs568184696
i703639 rs748522325
i713057 rs528436382
i713129 rs560208264
i707027 rs532649680
i713163 rs577119367
i707055 rs566696367
i713254 rs554477909
i713324 rs542280290
我的代码
with open('/////fileA','r') as bim:
with open ('////output.isec', 'w') as ic:
for k in bim:
l1 = k.split('\t')
size = len(str(l1[3]))
with open('/fileB', 'r') as file:
for m in file:
l2 = m.split('\t')
if len(l2[1]) != size:
continue
if l1[3] == l2[1]:
if l1[1] != l2[2]:
#print(l1[1],l2[2])
ic.write('{0}\t{1}\n'.format(l1[1],l2[2]))
break
文件 A 的 900 行脚本 (B(k^2)) 大约需要 30 分钟,如何修改我的脚本以缩短时间?
由于两个文件都是排序的,所以只需要每个文件遍历一次,读取文件中位置较低的下一行即可。这应该只需要几分钟:
bim = open('/////fileA','r')
ic = open('////output.isec', 'w')
file = open('/fileB', 'r')
bim_line = bim.readline()
line = file.readline()
while line and bim_line:
bim_split = bim_line.split("\t")
split = line.split("\t")
if bim_split[3] < split[1]:
bim_line = bim.readline()
elif split[1] < bim_split[3]:
line = file.readline()
else:
ic.write(bim_split[1] + "\t" + split[2] + "\n")
line = file.readline()
bim_line = bim.readline()
bim.close()
ic.close()
file.close()
我有两个文件:
- fileA 有 20,000 行
- fileB 有 16000000 行
我想比较文件 A 中的行 [3] 和文件 B 中的行 [1]。
flieA格式:
1 i713426 0 726912 0 0
1 i713449 0 830731 0 0
1 i707010 0 1183442 0 A
1 i713034 0 1225231 0 G
1 i703639 0 1267327 I D
1 i713057 0 1425512 0 T
1 i713129 0 1501061 0 G
1 i707027 0 1542721 0 C
1 i713163 0 1680617 0 C
1 i707055 0 1884055 0 C
1 i713254 0 2145254 0 C
1 i713324 0 2486696 0 C
1 i6059967 0 2526746 G A
1 i713334 0 2626131 0 0
1 i713335 0 2692373 0 C
1 i713341 0 3043138 0 A
1 i707150 0 3216645 0 0
1 i713347 0 3277176 0 G
fileB 来自于
chr1 87190 rs1524602 A/G 0.4358974358974359 0.8
chr1 87204 rs866881507 A/G 0.02564102564102564 0.2
chr1 87234 rs533355948 C/T 0.02564102564102564 0.2
chr1 87236 rs879825293 C/T 0.05128205128205128 0.2
chr1 87256 rs373216495 C/T 0.05128205128205128 0.6
chr1 87259 rs570089526 A/G 0.05128205128205128 0.6
chr1 87302 rs529420236 C/T 0.02564102564102564 0.2
chr1 87303 rs2103135 A/G 0.1282051282051282 0.4
chr1 87304 rs550004764 A/G 0.02564102564102564 0.2
chr1 87351 rs549570359 C/T 0.02564102564102564 0.2
chr1 87360 rs180907504 C/T 0.15384615384615385 0.6
chr1 87361 rs535266627 A/G 0.02564102564102564 0.4
chr1 87366 rs558417557 A/G 0.02564102564102564 0.4
chr1 87373 rs963638476 A/G 0.02564102564102564 0.2
chr1 87374 rs974579646 A/C 0.02564102564102564 0.2
如果文件 A 的行 [3] 等于文件 B 的行 [1] 则输出 print
i713426 rs567161598
i713449 rs547376081
i707010 rs566056983
i713034 rs568184696
i703639 rs748522325
i713057 rs528436382
i713129 rs560208264
i707027 rs532649680
i713163 rs577119367
i707055 rs566696367
i713254 rs554477909
i713324 rs542280290
我的代码
with open('/////fileA','r') as bim:
with open ('////output.isec', 'w') as ic:
for k in bim:
l1 = k.split('\t')
size = len(str(l1[3]))
with open('/fileB', 'r') as file:
for m in file:
l2 = m.split('\t')
if len(l2[1]) != size:
continue
if l1[3] == l2[1]:
if l1[1] != l2[2]:
#print(l1[1],l2[2])
ic.write('{0}\t{1}\n'.format(l1[1],l2[2]))
break
文件 A 的 900 行脚本 (B(k^2)) 大约需要 30 分钟,如何修改我的脚本以缩短时间?
由于两个文件都是排序的,所以只需要每个文件遍历一次,读取文件中位置较低的下一行即可。这应该只需要几分钟:
bim = open('/////fileA','r')
ic = open('////output.isec', 'w')
file = open('/fileB', 'r')
bim_line = bim.readline()
line = file.readline()
while line and bim_line:
bim_split = bim_line.split("\t")
split = line.split("\t")
if bim_split[3] < split[1]:
bim_line = bim.readline()
elif split[1] < bim_split[3]:
line = file.readline()
else:
ic.write(bim_split[1] + "\t" + split[2] + "\n")
line = file.readline()
bim_line = bim.readline()
bim.close()
ic.close()
file.close()