getline() 可以在循环中多次使用吗？ - Cython，文件读取

Question

我想读取一个 4 行 4 行的文件（这是一个带有 DNA 序列的 fastq 文件）。
当我一行一行或两两读取文件时，没有问题，但是当我一次读取 3 或 4 行时，我的代码崩溃了（kernel appeared have died on jupyter notebook). （取消注释最后一部分，或 4 个中的任意 3 个 getline()。
我尝试用双字符数组 (char**) 来存储行，但有同样的问题。

知道可能是什么原因吗？

使用 Python 3.7.3，Cython 0.29，更新所有其他库。正在读取的文件约为 1.3GB，机器有 8GB，ubuntu16.04。代码改编自 https://gist.github.com/pydemo/0b85bd5d1c017f6873422e02aeb9618a

%%cython
from libc.stdio cimport FILE, fopen, fclose, getline
    
def fastq_reader(early_stop=10):
    cdef const char* fname = b'/path/to/file'
    cdef FILE* cfile
    cfile = fopen(fname, "rb")

    cdef:
        char * line_0 = NULL
        char * line_1 = NULL
        char * line_2 = NULL
        char * line_3 = NULL
        size_t seed = 0
        ssize_t length_line
        unsigned long long line_nb = 0

    while True:
        length_line = getline(&line_0, &seed, cfile)
        if length_line < 0: break
        
        length_line = getline(&line_1, &seed, cfile)
        if length_line < 0: break
        
#         length_line = getline(&line_2, &seed, cfile)
#         if length_line < 0: break
        
#         length_line = getline(&line_3, &seed, cfile)
#         if length_line < 0: break

        line_nb += 4
        if line_nb > early_stop:
            break

    fclose(cfile)
    return line_nb

fastq_reader(early_stop=20000)

Answer 1

根本问题是我对getline()getline() c reference

的误解

要将行存储在不同的变量中，每个行指针 *lineptr 都需要关联的 n。

If *lineptr is set to NULL and *n is set 0 before the call, then getline() will allocate a buffer for storing the line.

Alternatively, before calling getline(), *lineptr can contain a pointer to a malloc(3)-allocated buffer *n bytes in size. If the buffer is not large enough to hold the line, getline() resizes it with realloc(3), updating *lineptr and *n as necessary.

n（或我的代码中的 seed）将保存为指针分配的缓冲区大小，其中 getline() 放置传入行。当我为不同的指针设置相同的缓冲区变量时，getline 得到了关于 char* line_xxx.

大小的错误信息

因为 fastq 文件通常是这样的：

@read_id_usually_short
CTATACCACCAAGGCTGGAAATTGTAAAACACACCGCCTGACATATCAATAAGGTGTCAAATTCCCTTTTCTCTAGCTTTCGTACT_very_long
+
-///.)/.-/)//-//..-*...-.&%&.--%#(++*/.//////,/*//+(.///..,%&-#&)..,)/.,.._same_length_as_line_2

缓冲区长度相同的一两个 getline() 没有错误，因为缓冲区太小并且 getline 调整了指针。
但是当使用 3 或 4 getlines() 时，调用 length_line = getline(&line_2, &seed, cfile) 被要求存储一个长度为 2 ('+\n') 的 char*，同时得到 ( 错误的信息) 指针 line_2 已经足够大（line_1 的大小）。

所以（简单的）解决方案是

%%cython
from libc.stdio cimport FILE, fopen, fclose, getline
    
def fastq_reader(early_stop=10):
    cdef const char* fname = b'/path/to/file'
    cdef FILE* cfile
    cfile = fopen(fname, "rb")

    cdef:
        char * line_0 = NULL
        char * line_1 = NULL
        char * line_2 = NULL
        char * line_3 = NULL
        # One variable for each line pointer
        size_t n_0 = 0
        size_t n_1 = 0
        size_t n_2 = 0
        size_t n_3 = 0
        ssize_t length_line
        unsigned long long line_nb = 0

    while True:
        # Reading the same file (same cfile), but line_x and n_x by pairs)
        length_line = getline(&line_0, &n_0, cfile)  
        if length_line < 0: break
        
        length_line = getline(&line_1, &n_1, cfile)
        if length_line < 0: break
        
        length_line = getline(&line_2, &n_2, cfile)
        if length_line < 0: break
        
        length_line = getline(&line_3, &n_3, cfile)
        if length_line < 0: break

        line_nb += 4
        if line_nb > early_stop:
            break

    fclose(cfile)
    return line_nb

fastq_reader(early_stop=20000)

感谢您指出我的错误。

getline() 可以在循环中多次使用吗？ - Cython，文件读取

Can getline() be used multiple times within a loop? - Cython, file reading

python

stdio

getline

cython

file-read