在 visual studio 中处理两个 32GB 的文件 python 代码在几天后未处理

Processing two files 32GB with python in visual studio code not processing after days

我正在尝试从一个 32GB 的文件中的特定行获取数据,将提取的数据放入字典中,然后读入另一个 32GB 的文件,以用之前创建的字典中的键和值替换特定行.最后,我试图将所有这些新信息放入一个 b运行d 新文件中。

可是我运行程序的时候,12个多小时了,还是运行。我实施了一个进度条,已经过去了 2 个小时,但还没有取得百分之一的进展。我没有收到错误消息,但我看不到任何进展。有人知道为什么吗?也许读取这么大的文件有问题?任何帮助,将不胜感激。这是我使用的代码。

import gzip
from itertools import islice
from datetime import datetime
import time
from tqdm import tqdm

from tqdm import tqdm
for i in tqdm(range(10000)):

    ## store the time the program started running
    start_time = time.time()
    ########### R1 #############

    ## create dictionary that will store the read ID as keys and barcode as values
    readID_dictionary = {}

    ## open the read 1 fastq file as R1
    with open('/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R1.fastq', 'rt') as R1:
    
        ## for each line in the file, look for the read ID that starts with @ and store the read name, sequence, blank (+), and quality score as variables
        for line in R1:

            ## only perform operations if the line starts with @
            if line[0] == '@':

                ## split the lines by whitespace
                readID = line.split()

                ## store the read id for each read in a variable
                readID = readID[0]

                ## store the sequence for each read in a variable
                sequence = next(R1)

                ## store the barcode (first 20 characters)
                barcode = sequence[:20]

                ## append reads as keys and barcodes as values respectfully in dictionary
                readID_dictionary[readID] = barcode

###########################################################

########### R2 #############

    ## content that will be in new file
    new_file_content = ""

## open R2 .fastq file as R2
    with open('/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R2_copy.fastq', 'rt') as R2:

    ## for each line in the file
        for line in R2:

        ## if the line starts with @ perform the operations
            if line[0] == '@':

            ## split the lines by whitespace
                readID = line.split()

            ## store the read ID 
                readID = readID[0]

            ## if the read ID matches the read ID from R1 (key in dictionary) then have the read ID in R2 equal that ID with _ and barcode
                for key, value in readID_dictionary.items():
                    if readID == key:
                        readID = key + '_' + value

            ## store sequence 
                sequence = next(R2)

            ## store blank (plus sign)
                blank = next(R2)

            ## store quality score
                quality = next(R2)

            ## format the content for the new file
                new_file_content += readID +'\n' + sequence + blank + quality 

###########################################################

########### NEW FILE WITH UPDATED READID+BARCODE #############

## create a new file with the updated read ID
    writing_file = open("/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R2_test.fastq", "w")

## put the content in the new file
    writing_file.write(new_file_content)

## close the file
    writing_file.close()

###########################################################

###########################################################

########### REPORT #############

## show how long program took to run
    print("Process finished --- %s seconds ---" % (time.time() - start_time))

###########################################################

一般来说,我总是先尝试 运行 一个小得多的输入(1 MB、10 MB、100 MB?)的程序,看看程序是否正常工作,如果是,如何每 MB 需要很长时间。然后我可以计算出整个文件大约需要多长时间,以及在进度中的哪个时间预计会有多少进度。

也许您甚至可以进行那些小文件测试,同时将其留在大文件上 运行ning,至少可以看到该程序确实有效并且最终会完成(不会丢失您当前的进度)。首先尝试使用一个非常小的文件(可能是大文件的前 1 MB),然后如果工作正常的话可能会增加大小。

不过看看实际程序,我肯定不会将所有数据收集到内存中,最后才写入。我会不断地写入输出文件。这样效率更高,并且不会像您在当前程序中使用的那样可能使用大量虚拟内存。

所以,像这样的东西(没有测试,因为我不能):

import gzip
from itertools import islice
from datetime import datetime
import time
from tqdm import tqdm

from tqdm import tqdm
for i in tqdm(range(10000)):

    ## store the time the program started running
    start_time = time.time()
    ########### R1 #############

    ## create dictionary that will store the read ID as keys and barcode as values
    readID_dictionary = {}

    ## open the read 1 fastq file as R1
    with open('/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R1.fastq', 'rt') as R1:
    
        ## for each line in the file, look for the read ID that starts with @ and store the read name, sequence, blank (+), and quality score as variables
        for line in R1:

            ## only perform operations if the line starts with @
            if line[0] == '@':

                ## split the lines by whitespace
                readID = line.split()

                ## store the read id for each read in a variable
                readID = readID[0]

                ## store the sequence for each read in a variable
                sequence = next(R1)

                ## store the barcode (first 20 characters)
                barcode = sequence[:20]

                ## append reads as keys and barcodes as values respectfully in dictionary
                readID_dictionary[readID] = barcode

###########################################################

########### R2 #############

    ## content that will be in new file
    new_file_content = ""

    ## open R2 .fastq file as R2
    with open('/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R2_copy.fastq', 'rt') as R2:

        ## create a new file with the updated read ID
        with open("/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R2_test.fastq", "w") as writing_file:


        ## for each line in the file
            for line in R2:

            ## if the line starts with @ perform the operations
                if line[0] == '@':

                ## split the lines by whitespace
                    readID = line.split()

                ## store the read ID 
                    readID = readID[0]

                ## if the read ID matches the read ID from R1 (key in dictionary) then have the read ID in R2 equal that ID with _ and barcode
                    for key, value in readID_dictionary.items():
                        if readID == key:
                            readID = key + '_' + value

                ## store sequence 
                    sequence = next(R2)

                ## store blank (plus sign)
                    blank = next(R2)

                ## store quality score
                    quality = next(R2)

                ## format the content for the new file
                ## and put the content in the new file
                    writing_file.write(readID +'\n' + sequence + blank + quality)

###########################################################


###########################################################

########### REPORT #############

## show how long program took to run
    print("Process finished --- %s seconds ---" % (time.time() - start_time))

###########################################################

将字典用作字典而不是列表,

不要将新文件内容保存在内存中:在处理过程中将其写入磁盘

import gzip
from itertools import islice
from datetime import datetime
import time
from tqdm import tqdm

for i in tqdm(range(10000)):

    ## store the time the program started running
    start_time = time.time()
    ########### R1 #############

    ## create dictionary that will store the read ID as keys and barcode as values
    readID_dictionary = {}

    ## open the read 1 fastq file as R1
    with open('/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R1.fastq', 'rt') as R1:
    
        ## for each line in the file, look for the read ID that starts with @ and store the read name, sequence, blank (+), and quality score as variables
        for line in R1:
            ## only perform operations if the line starts with @
            if line[0] != '@': continue
            readID = line.split()[0]
            ## store the barcode (first 20 characters of next line)
            readID_dictionary[readID] = next(R1)[:20]

###########################################################

########### R2 #############

## open R2 .fastq file as R2
    with open('/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R2_copy.fastq', 'rt') as R2:
        with open("/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R2_test.fastq", "w") as newfile:
            for line in R2:
                if line[0] != '@': continue
                readID = line.split()[0]
                ## if the read ID matches the read ID from R1 (key in dictionary) then have the read ID in R2 equal that ID with _ and barcode
                if readID in readID_dictionary:
                    readID = readID + '_' + readID_dictionary[readID]
                ## store sequence 
                sequence = next(R2)
                ## store blank (plus sign)
                blank = next(R2)
                ## store quality score
                quality = next(R2)
                ## format the content for the new file
                newfile.write(readID +'\n')
                newfile.write(sequence + blank + quality)

###########################################################

########### REPORT #############

## show how long program took to run
    print("Process finished --- %s seconds ---" % (time.time() - start_time))

###########################################################