在 visual studio 中处理两个 32GB 的文件 python 代码在几天后未处理

Question

我正在尝试从一个 32GB 的文件中的特定行获取数据，将提取的数据放入字典中，然后读入另一个 32GB 的文件，以用之前创建的字典中的键和值替换特定行.最后，我试图将所有这些新信息放入一个 b运行d 新文件中。

可是我运行程序的时候，12个多小时了，还是运行。我实施了一个进度条，已经过去了 2 个小时，但还没有取得百分之一的进展。我没有收到错误消息，但我看不到任何进展。有人知道为什么吗？也许读取这么大的文件有问题？任何帮助，将不胜感激。这是我使用的代码。

import gzip
from itertools import islice
from datetime import datetime
import time
from tqdm import tqdm

from tqdm import tqdm
for i in tqdm(range(10000)):

    ## store the time the program started running
    start_time = time.time()
    ########### R1 #############

    ## create dictionary that will store the read ID as keys and barcode as values
    readID_dictionary = {}

    ## open the read 1 fastq file as R1
    with open('/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R1.fastq', 'rt') as R1:
    
        ## for each line in the file, look for the read ID that starts with @ and store the read name, sequence, blank (+), and quality score as variables
        for line in R1:

            ## only perform operations if the line starts with @
            if line[0] == '@':

                ## split the lines by whitespace
                readID = line.split()

                ## store the read id for each read in a variable
                readID = readID[0]

                ## store the sequence for each read in a variable
                sequence = next(R1)

                ## store the barcode (first 20 characters)
                barcode = sequence[:20]

                ## append reads as keys and barcodes as values respectfully in dictionary
                readID_dictionary[readID] = barcode

###########################################################

########### R2 #############

    ## content that will be in new file
    new_file_content = ""

## open R2 .fastq file as R2
    with open('/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R2_copy.fastq', 'rt') as R2:

    ## for each line in the file
        for line in R2:

        ## if the line starts with @ perform the operations
            if line[0] == '@':

            ## split the lines by whitespace
                readID = line.split()

            ## store the read ID 
                readID = readID[0]

            ## if the read ID matches the read ID from R1 (key in dictionary) then have the read ID in R2 equal that ID with _ and barcode
                for key, value in readID_dictionary.items():
                    if readID == key:
                        readID = key + '_' + value

            ## store sequence 
                sequence = next(R2)

            ## store blank (plus sign)
                blank = next(R2)

            ## store quality score
                quality = next(R2)

            ## format the content for the new file
                new_file_content += readID +'\n' + sequence + blank + quality 

###########################################################

########### NEW FILE WITH UPDATED READID+BARCODE #############

## create a new file with the updated read ID
    writing_file = open("/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R2_test.fastq", "w")

## put the content in the new file
    writing_file.write(new_file_content)

## close the file
    writing_file.close()

###########################################################

###########################################################

########### REPORT #############

## show how long program took to run
    print("Process finished --- %s seconds ---" % (time.time() - start_time))

###########################################################

Answer 1

一般来说，我总是先尝试运行一个小得多的输入（1 MB、10 MB、100 MB？）的程序，看看程序是否正常工作，如果是，如何每 MB 需要很长时间。然后我可以计算出整个文件大约需要多长时间，以及在进度中的哪个时间预计会有多少进度。

也许您甚至可以进行那些小文件测试，同时将其留在大文件上运行ning，至少可以看到该程序确实有效并且最终会完成（不会丢失您当前的进度）。首先尝试使用一个非常小的文件（可能是大文件的前 1 MB），然后如果工作正常的话可能会增加大小。

不过看看实际程序，我肯定不会将所有数据收集到内存中，最后才写入。我会不断地写入输出文件。这样效率更高，并且不会像您在当前程序中使用的那样可能使用大量虚拟内存。

所以，像这样的东西（没有测试，因为我不能）：

import gzip
from itertools import islice
from datetime import datetime
import time
from tqdm import tqdm

from tqdm import tqdm
for i in tqdm(range(10000)):

    ## store the time the program started running
    start_time = time.time()
    ########### R1 #############

    ## create dictionary that will store the read ID as keys and barcode as values
    readID_dictionary = {}

    ## open the read 1 fastq file as R1
    with open('/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R1.fastq', 'rt') as R1:
    
        ## for each line in the file, look for the read ID that starts with @ and store the read name, sequence, blank (+), and quality score as variables
        for line in R1:

            ## only perform operations if the line starts with @
            if line[0] == '@':

                ## split the lines by whitespace
                readID = line.split()

                ## store the read id for each read in a variable
                readID = readID[0]

                ## store the sequence for each read in a variable
                sequence = next(R1)

                ## store the barcode (first 20 characters)
                barcode = sequence[:20]

                ## append reads as keys and barcodes as values respectfully in dictionary
                readID_dictionary[readID] = barcode

###########################################################

########### R2 #############

    ## content that will be in new file
    new_file_content = ""

    ## open R2 .fastq file as R2
    with open('/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R2_copy.fastq', 'rt') as R2:

        ## create a new file with the updated read ID
        with open("/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R2_test.fastq", "w") as writing_file:


        ## for each line in the file
            for line in R2:

            ## if the line starts with @ perform the operations
                if line[0] == '@':

                ## split the lines by whitespace
                    readID = line.split()

                ## store the read ID 
                    readID = readID[0]

                ## if the read ID matches the read ID from R1 (key in dictionary) then have the read ID in R2 equal that ID with _ and barcode
                    for key, value in readID_dictionary.items():
                        if readID == key:
                            readID = key + '_' + value

                ## store sequence 
                    sequence = next(R2)

                ## store blank (plus sign)
                    blank = next(R2)

                ## store quality score
                    quality = next(R2)

                ## format the content for the new file
                ## and put the content in the new file
                    writing_file.write(readID +'\n' + sequence + blank + quality)

###########################################################


###########################################################

########### REPORT #############

## show how long program took to run
    print("Process finished --- %s seconds ---" % (time.time() - start_time))

###########################################################

Answer 2

将字典用作字典而不是列表，

不要将新文件内容保存在内存中：在处理过程中将其写入磁盘

import gzip
from itertools import islice
from datetime import datetime
import time
from tqdm import tqdm

for i in tqdm(range(10000)):

    ## store the time the program started running
    start_time = time.time()
    ########### R1 #############

    ## create dictionary that will store the read ID as keys and barcode as values
    readID_dictionary = {}

    ## open the read 1 fastq file as R1
    with open('/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R1.fastq', 'rt') as R1:
    
        ## for each line in the file, look for the read ID that starts with @ and store the read name, sequence, blank (+), and quality score as variables
        for line in R1:
            ## only perform operations if the line starts with @
            if line[0] != '@': continue
            readID = line.split()[0]
            ## store the barcode (first 20 characters of next line)
            readID_dictionary[readID] = next(R1)[:20]

###########################################################

########### R2 #############

## open R2 .fastq file as R2
    with open('/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R2_copy.fastq', 'rt') as R2:
        with open("/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R2_test.fastq", "w") as newfile:
            for line in R2:
                if line[0] != '@': continue
                readID = line.split()[0]
                ## if the read ID matches the read ID from R1 (key in dictionary) then have the read ID in R2 equal that ID with _ and barcode
                if readID in readID_dictionary:
                    readID = readID + '_' + readID_dictionary[readID]
                ## store sequence 
                sequence = next(R2)
                ## store blank (plus sign)
                blank = next(R2)
                ## store quality score
                quality = next(R2)
                ## format the content for the new file
                newfile.write(readID +'\n')
                newfile.write(sequence + blank + quality)

###########################################################

########### REPORT #############

## show how long program took to run
    print("Process finished --- %s seconds ---" % (time.time() - start_time))

###########################################################

在 visual studio 中处理两个 32GB 的文件 python 代码在几天后未处理

Processing two files 32GB with python in visual studio code not processing after days

python

bigdata