在 visual studio 中处理两个 32GB 的文件 python 代码在几天后未处理
Processing two files 32GB with python in visual studio code not processing after days
我正在尝试从一个 32GB 的文件中的特定行获取数据,将提取的数据放入字典中,然后读入另一个 32GB 的文件,以用之前创建的字典中的键和值替换特定行.最后,我试图将所有这些新信息放入一个 b运行d 新文件中。
可是我运行程序的时候,12个多小时了,还是运行。我实施了一个进度条,已经过去了 2 个小时,但还没有取得百分之一的进展。我没有收到错误消息,但我看不到任何进展。有人知道为什么吗?也许读取这么大的文件有问题?任何帮助,将不胜感激。这是我使用的代码。
import gzip
from itertools import islice
from datetime import datetime
import time
from tqdm import tqdm
from tqdm import tqdm
for i in tqdm(range(10000)):
## store the time the program started running
start_time = time.time()
########### R1 #############
## create dictionary that will store the read ID as keys and barcode as values
readID_dictionary = {}
## open the read 1 fastq file as R1
with open('/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R1.fastq', 'rt') as R1:
## for each line in the file, look for the read ID that starts with @ and store the read name, sequence, blank (+), and quality score as variables
for line in R1:
## only perform operations if the line starts with @
if line[0] == '@':
## split the lines by whitespace
readID = line.split()
## store the read id for each read in a variable
readID = readID[0]
## store the sequence for each read in a variable
sequence = next(R1)
## store the barcode (first 20 characters)
barcode = sequence[:20]
## append reads as keys and barcodes as values respectfully in dictionary
readID_dictionary[readID] = barcode
###########################################################
########### R2 #############
## content that will be in new file
new_file_content = ""
## open R2 .fastq file as R2
with open('/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R2_copy.fastq', 'rt') as R2:
## for each line in the file
for line in R2:
## if the line starts with @ perform the operations
if line[0] == '@':
## split the lines by whitespace
readID = line.split()
## store the read ID
readID = readID[0]
## if the read ID matches the read ID from R1 (key in dictionary) then have the read ID in R2 equal that ID with _ and barcode
for key, value in readID_dictionary.items():
if readID == key:
readID = key + '_' + value
## store sequence
sequence = next(R2)
## store blank (plus sign)
blank = next(R2)
## store quality score
quality = next(R2)
## format the content for the new file
new_file_content += readID +'\n' + sequence + blank + quality
###########################################################
########### NEW FILE WITH UPDATED READID+BARCODE #############
## create a new file with the updated read ID
writing_file = open("/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R2_test.fastq", "w")
## put the content in the new file
writing_file.write(new_file_content)
## close the file
writing_file.close()
###########################################################
###########################################################
########### REPORT #############
## show how long program took to run
print("Process finished --- %s seconds ---" % (time.time() - start_time))
###########################################################
一般来说,我总是先尝试 运行 一个小得多的输入(1 MB、10 MB、100 MB?)的程序,看看程序是否正常工作,如果是,如何每 MB 需要很长时间。然后我可以计算出整个文件大约需要多长时间,以及在进度中的哪个时间预计会有多少进度。
也许您甚至可以进行那些小文件测试,同时将其留在大文件上 运行ning,至少可以看到该程序确实有效并且最终会完成(不会丢失您当前的进度)。首先尝试使用一个非常小的文件(可能是大文件的前 1 MB),然后如果工作正常的话可能会增加大小。
不过看看实际程序,我肯定不会将所有数据收集到内存中,最后才写入。我会不断地写入输出文件。这样效率更高,并且不会像您在当前程序中使用的那样可能使用大量虚拟内存。
所以,像这样的东西(没有测试,因为我不能):
import gzip
from itertools import islice
from datetime import datetime
import time
from tqdm import tqdm
from tqdm import tqdm
for i in tqdm(range(10000)):
## store the time the program started running
start_time = time.time()
########### R1 #############
## create dictionary that will store the read ID as keys and barcode as values
readID_dictionary = {}
## open the read 1 fastq file as R1
with open('/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R1.fastq', 'rt') as R1:
## for each line in the file, look for the read ID that starts with @ and store the read name, sequence, blank (+), and quality score as variables
for line in R1:
## only perform operations if the line starts with @
if line[0] == '@':
## split the lines by whitespace
readID = line.split()
## store the read id for each read in a variable
readID = readID[0]
## store the sequence for each read in a variable
sequence = next(R1)
## store the barcode (first 20 characters)
barcode = sequence[:20]
## append reads as keys and barcodes as values respectfully in dictionary
readID_dictionary[readID] = barcode
###########################################################
########### R2 #############
## content that will be in new file
new_file_content = ""
## open R2 .fastq file as R2
with open('/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R2_copy.fastq', 'rt') as R2:
## create a new file with the updated read ID
with open("/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R2_test.fastq", "w") as writing_file:
## for each line in the file
for line in R2:
## if the line starts with @ perform the operations
if line[0] == '@':
## split the lines by whitespace
readID = line.split()
## store the read ID
readID = readID[0]
## if the read ID matches the read ID from R1 (key in dictionary) then have the read ID in R2 equal that ID with _ and barcode
for key, value in readID_dictionary.items():
if readID == key:
readID = key + '_' + value
## store sequence
sequence = next(R2)
## store blank (plus sign)
blank = next(R2)
## store quality score
quality = next(R2)
## format the content for the new file
## and put the content in the new file
writing_file.write(readID +'\n' + sequence + blank + quality)
###########################################################
###########################################################
########### REPORT #############
## show how long program took to run
print("Process finished --- %s seconds ---" % (time.time() - start_time))
###########################################################
将字典用作字典而不是列表,
不要将新文件内容保存在内存中:在处理过程中将其写入磁盘
import gzip
from itertools import islice
from datetime import datetime
import time
from tqdm import tqdm
for i in tqdm(range(10000)):
## store the time the program started running
start_time = time.time()
########### R1 #############
## create dictionary that will store the read ID as keys and barcode as values
readID_dictionary = {}
## open the read 1 fastq file as R1
with open('/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R1.fastq', 'rt') as R1:
## for each line in the file, look for the read ID that starts with @ and store the read name, sequence, blank (+), and quality score as variables
for line in R1:
## only perform operations if the line starts with @
if line[0] != '@': continue
readID = line.split()[0]
## store the barcode (first 20 characters of next line)
readID_dictionary[readID] = next(R1)[:20]
###########################################################
########### R2 #############
## open R2 .fastq file as R2
with open('/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R2_copy.fastq', 'rt') as R2:
with open("/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R2_test.fastq", "w") as newfile:
for line in R2:
if line[0] != '@': continue
readID = line.split()[0]
## if the read ID matches the read ID from R1 (key in dictionary) then have the read ID in R2 equal that ID with _ and barcode
if readID in readID_dictionary:
readID = readID + '_' + readID_dictionary[readID]
## store sequence
sequence = next(R2)
## store blank (plus sign)
blank = next(R2)
## store quality score
quality = next(R2)
## format the content for the new file
newfile.write(readID +'\n')
newfile.write(sequence + blank + quality)
###########################################################
########### REPORT #############
## show how long program took to run
print("Process finished --- %s seconds ---" % (time.time() - start_time))
###########################################################
我正在尝试从一个 32GB 的文件中的特定行获取数据,将提取的数据放入字典中,然后读入另一个 32GB 的文件,以用之前创建的字典中的键和值替换特定行.最后,我试图将所有这些新信息放入一个 b运行d 新文件中。
可是我运行程序的时候,12个多小时了,还是运行。我实施了一个进度条,已经过去了 2 个小时,但还没有取得百分之一的进展。我没有收到错误消息,但我看不到任何进展。有人知道为什么吗?也许读取这么大的文件有问题?任何帮助,将不胜感激。这是我使用的代码。
import gzip
from itertools import islice
from datetime import datetime
import time
from tqdm import tqdm
from tqdm import tqdm
for i in tqdm(range(10000)):
## store the time the program started running
start_time = time.time()
########### R1 #############
## create dictionary that will store the read ID as keys and barcode as values
readID_dictionary = {}
## open the read 1 fastq file as R1
with open('/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R1.fastq', 'rt') as R1:
## for each line in the file, look for the read ID that starts with @ and store the read name, sequence, blank (+), and quality score as variables
for line in R1:
## only perform operations if the line starts with @
if line[0] == '@':
## split the lines by whitespace
readID = line.split()
## store the read id for each read in a variable
readID = readID[0]
## store the sequence for each read in a variable
sequence = next(R1)
## store the barcode (first 20 characters)
barcode = sequence[:20]
## append reads as keys and barcodes as values respectfully in dictionary
readID_dictionary[readID] = barcode
###########################################################
########### R2 #############
## content that will be in new file
new_file_content = ""
## open R2 .fastq file as R2
with open('/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R2_copy.fastq', 'rt') as R2:
## for each line in the file
for line in R2:
## if the line starts with @ perform the operations
if line[0] == '@':
## split the lines by whitespace
readID = line.split()
## store the read ID
readID = readID[0]
## if the read ID matches the read ID from R1 (key in dictionary) then have the read ID in R2 equal that ID with _ and barcode
for key, value in readID_dictionary.items():
if readID == key:
readID = key + '_' + value
## store sequence
sequence = next(R2)
## store blank (plus sign)
blank = next(R2)
## store quality score
quality = next(R2)
## format the content for the new file
new_file_content += readID +'\n' + sequence + blank + quality
###########################################################
########### NEW FILE WITH UPDATED READID+BARCODE #############
## create a new file with the updated read ID
writing_file = open("/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R2_test.fastq", "w")
## put the content in the new file
writing_file.write(new_file_content)
## close the file
writing_file.close()
###########################################################
###########################################################
########### REPORT #############
## show how long program took to run
print("Process finished --- %s seconds ---" % (time.time() - start_time))
###########################################################
一般来说,我总是先尝试 运行 一个小得多的输入(1 MB、10 MB、100 MB?)的程序,看看程序是否正常工作,如果是,如何每 MB 需要很长时间。然后我可以计算出整个文件大约需要多长时间,以及在进度中的哪个时间预计会有多少进度。
也许您甚至可以进行那些小文件测试,同时将其留在大文件上 运行ning,至少可以看到该程序确实有效并且最终会完成(不会丢失您当前的进度)。首先尝试使用一个非常小的文件(可能是大文件的前 1 MB),然后如果工作正常的话可能会增加大小。
不过看看实际程序,我肯定不会将所有数据收集到内存中,最后才写入。我会不断地写入输出文件。这样效率更高,并且不会像您在当前程序中使用的那样可能使用大量虚拟内存。
所以,像这样的东西(没有测试,因为我不能):
import gzip
from itertools import islice
from datetime import datetime
import time
from tqdm import tqdm
from tqdm import tqdm
for i in tqdm(range(10000)):
## store the time the program started running
start_time = time.time()
########### R1 #############
## create dictionary that will store the read ID as keys and barcode as values
readID_dictionary = {}
## open the read 1 fastq file as R1
with open('/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R1.fastq', 'rt') as R1:
## for each line in the file, look for the read ID that starts with @ and store the read name, sequence, blank (+), and quality score as variables
for line in R1:
## only perform operations if the line starts with @
if line[0] == '@':
## split the lines by whitespace
readID = line.split()
## store the read id for each read in a variable
readID = readID[0]
## store the sequence for each read in a variable
sequence = next(R1)
## store the barcode (first 20 characters)
barcode = sequence[:20]
## append reads as keys and barcodes as values respectfully in dictionary
readID_dictionary[readID] = barcode
###########################################################
########### R2 #############
## content that will be in new file
new_file_content = ""
## open R2 .fastq file as R2
with open('/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R2_copy.fastq', 'rt') as R2:
## create a new file with the updated read ID
with open("/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R2_test.fastq", "w") as writing_file:
## for each line in the file
for line in R2:
## if the line starts with @ perform the operations
if line[0] == '@':
## split the lines by whitespace
readID = line.split()
## store the read ID
readID = readID[0]
## if the read ID matches the read ID from R1 (key in dictionary) then have the read ID in R2 equal that ID with _ and barcode
for key, value in readID_dictionary.items():
if readID == key:
readID = key + '_' + value
## store sequence
sequence = next(R2)
## store blank (plus sign)
blank = next(R2)
## store quality score
quality = next(R2)
## format the content for the new file
## and put the content in the new file
writing_file.write(readID +'\n' + sequence + blank + quality)
###########################################################
###########################################################
########### REPORT #############
## show how long program took to run
print("Process finished --- %s seconds ---" % (time.time() - start_time))
###########################################################
将字典用作字典而不是列表,
不要将新文件内容保存在内存中:在处理过程中将其写入磁盘
import gzip
from itertools import islice
from datetime import datetime
import time
from tqdm import tqdm
for i in tqdm(range(10000)):
## store the time the program started running
start_time = time.time()
########### R1 #############
## create dictionary that will store the read ID as keys and barcode as values
readID_dictionary = {}
## open the read 1 fastq file as R1
with open('/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R1.fastq', 'rt') as R1:
## for each line in the file, look for the read ID that starts with @ and store the read name, sequence, blank (+), and quality score as variables
for line in R1:
## only perform operations if the line starts with @
if line[0] != '@': continue
readID = line.split()[0]
## store the barcode (first 20 characters of next line)
readID_dictionary[readID] = next(R1)[:20]
###########################################################
########### R2 #############
## open R2 .fastq file as R2
with open('/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R2_copy.fastq', 'rt') as R2:
with open("/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R2_test.fastq", "w") as newfile:
for line in R2:
if line[0] != '@': continue
readID = line.split()[0]
## if the read ID matches the read ID from R1 (key in dictionary) then have the read ID in R2 equal that ID with _ and barcode
if readID in readID_dictionary:
readID = readID + '_' + readID_dictionary[readID]
## store sequence
sequence = next(R2)
## store blank (plus sign)
blank = next(R2)
## store quality score
quality = next(R2)
## format the content for the new file
newfile.write(readID +'\n')
newfile.write(sequence + blank + quality)
###########################################################
########### REPORT #############
## show how long program took to run
print("Process finished --- %s seconds ---" % (time.time() - start_time))
###########################################################