Python：一种使用 read() 换行 ignore/account 的方法

Question

所以我在从较大 (>GB) 的文本文件中提取文本时遇到问题。文件结构如下：

>header1
hereComesTextWithNewlineAtPosition_80
hereComesTextWithNewlineAtPosition_80
hereComesTextWithNewlineAtPosition_80
andEnds
>header2
hereComesTextWithNewlineAtPosition_80
hereComesTextWithNewlineAtPosAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAlineAtPosition_80
MaybeAnotherTargetBBBBBBBBBBBrestText
andEndsSomewhereHere

现在我得到的信息是，在带有 header2 的条目中，我需要将文本从位置 X 提取到位置 Y（本例中为 A），以 1 作为行中的第一个字母header 下面。

但是：这些位置不考虑换行符。所以基本上当它说从 1 到 95 时，它实际上只表示从 1 到 80 的字母以及下一行的以下 15 个字母。

我的第一个解决方案是使用file.read(X-1)跳过前面不需要的部分然后file.read(Y-X)得到我想要的部分，但是当它延伸到换行符时，我提取的字符很少。

有没有办法用另一个 python-function 而不是 read() 来解决这个问题？我考虑过用空字符串替换所有换行符，但文件可能非常大（数百万行）。

我还尝试通过将 extractLength // 80 作为增加的长度来考虑换行符，但这在例如示例的情况下是有问题的。 95 个字符，它是 2-80-3，超过 3 行我实际上需要 2 个额外的位置，但 95 // 80 是 1.

更新：

我修改了我的代码以使用 Biopython：

for s in SeqIO.parse(sys.argv[2], "fasta"): 
        #foundClusters stores the information for substrings I want extracted
        currentCluster = foundClusters.get(s.id)

        if(currentCluster is not None):

            for i in range(len(currentCluster)):

                outputFile.write(">"+s.id+"|cluster"+str(i)+"\n")

                flanking = 25

                start = currentCluster[i][0]
                end = currentCluster[i][1]
                left = currentCluster[i][2]

                if(start - flanking < 0):
                    start = 0
                else:
                    start = start - flanking

                if(end + flanking > end + left):
                    end = end + left
                else:
                    end = end + flanking

                #for debugging only
                print(currentCluster)
                print(start)
                print(end)

                outputFile.write(s.seq[start, end+1])

但是我得到以下错误：

[[1, 55, 2782]]
0
80
Traceback (most recent call last):
  File "findClaClusters.py", line 92, in <module>
    outputFile.write(s.seq[start, end+1])
  File "/usr/local/lib/python3.4/dist-packages/Bio/Seq.py", line 236, in __getitem__
   return Seq(self._data[index], self.alphabet)
TypeError: string indices must be integers

更新 2：

已将 outputFile.write(s.seq[start, end+1]) 更改为：

outRecord = SeqRecord(s.seq[start: end+1], id=s.id+"|cluster"+str(i), description="Repeat-Cluster")
SeqIO.write(outRecord, outputFile, "fasta")

及其工作:)

Answer 1

与Biopython:

from Bio import SeqIO
X = 66
Y = 130
for s in in SeqIO.parse("test.fst", "fasta"):
    if "header2" == s.id:
         print s.seq[X: Y+1]
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Biopython 可让您轻松解析 fasta 文件并访问其 ID、描述和序列。然后你有一个 Seq 对象，你可以方便地操作它而无需重新编码所有内容（如反向补码等）。

Python：一种使用 read() 换行 ignore/account 的方法

Python: a way to ignore/account for newlines with read()

python

newline

file-read

fasta