如何计算python中分区字符的出现次数？

Question

我有一个包含序列的大文件；我只想分析最后一组字符，它们恰好是可变长度的。在每一行中，我想在文本文件中获取每个集合的第一个字符和最后一个字符，并计算这些字符的总数。

以下是文件中的数据示例：

-1iqd_BA_0_CDRH3.pdb kabat H3 PDPDAFDV

-1iqw_HL_0_CDRH3.pdb kabat H3 NRDYSNNWYFDV

我想取 "H3" 之后的第一个字符和最后一个字符（示例中均为粗体）。这两行的输出应该是：

第一个计数器({'N': 1, 'P': 1})

最后一个计数器({'V': 2})

这是我目前所做的：

f = open("C:/CDRH3.txt", "r")
from collections import Counter
grab = 1
for line in f:
   line=line.rstrip()
   left,sep,right=line.partition(" H3 ")
   if sep:
         AminoAcidsFirst = right[:grab] 
         AminoAcidsLast = right[-grab:]
print ("first ",Counter(line[:] for line in AminoAcidsFirst))
print ("last ",Counter(line[:] for line in AminoAcidsLast))
f.close()

这仅打印最后一行数据的计数，如下所示：

first Counter({'N': 1})
last Counter({'V': 1})

如何计算文件中所有行中的所有这些字符？笔记：打印 (AminoAcidsFirst) 或 (AminoAcidsLast) 给出了所有垂直行的所需列表，但我无法计算它或将其输出到文件中。写入新文件只会写入原文件最后一行的字符。谢谢！

Answer 1

创建 2 个空列表并像这样在每个循环中追加：

f = open("C:/CDRH3.txt", "r")
from collections import Counter
grab = 1
AminoAcidsFirst = []
AminoAcidsLast = []
for line in f:
   line=line.rstrip()
   left,sep,right=line.partition(" H3 ")
   if sep:
         AminoAcidsFirst.append(right[:grab])
         AminoAcidsLast.append(right[-grab:])
print ("first ",Counter(line[:] for line in AminoAcidsFirst))
print ("last ",Counter(line[:] for line in AminoAcidsLast))
f.close()

这里：

创建空列表：

AminoAcidsFirst = [] AminoAcidsLast = []
在每个循环中追加：

AminoAcidsFirst.append(right[:grab]) AminoAcidsLast.append(right[-grab:])

Answer 2

无需计数器：只需在 spliting 后获取最后一个标记并计算第一个和最后一个字符：

first_counter = {}
last_counter = {}
for line in f:
   line=line.split()[-1]   # grab the last token
   first_counter[line[0]] = first_counter.get(line[0], 0) + 1
   last_counter[line[-1]] = last_counter.get(line[-1], 0) + 1    

print("first ", first_counter)
print("last ", last_counter)

输出

first  {'P': 1, 'N': 1}
last  {'V': 2}

Answer 3

我想指出两件重要的事情

永远不要泄露文件在您计算机上的路径，如果您来自科学界，这尤其适用
使用 with...as 方法

现在节目

from collections import Counter

filePath = "C:/CDRH3.txt"
AminoAcidsFirst, AminoAcidsLast = [], [] # important! these should be lists

with open(filePath, 'rt') as f:  # rt not r. Explicit is better than implicit
    for line in f:
        line = line.rstrip()
        left, sep, right = line.partition(" H3 ")
        if sep:
            AminoAcidsFirst.append( right[0] ) # really no need of extra grab=1 variable
            AminoAcidsLast.append( right[-1] ) # better than right[-grab:]
print ("first ",Counter(AminoAcidsFirst))
print ("last ",Counter(AminoAcidsLast))

不要line.strip()[-1]因为sep验证很重要

输出

first  {'P': 1, 'N': 1}
last  {'V': 2}

注意： 数据文件可能会变得非常大，您可能运行出现内存问题或计算机挂起。那么，我可以建议懒惰阅读吗？下面是更健壮的程序

from collections import Counter

filePath = "C:/CDRH3.txt"
AminoAcidsFirst, AminoAcidsLast = [], [] # important! these should be lists

def chunk_read(fileObj, linesCount = 100):
    lines = fileObj.readlines(linesCount)
    yield lines

with open(filePath, 'rt') as f:  # rt not r. Explicit is better than implicit
    for aChunk in chunk_read(f):
        for line in aChunk:
            line = line.rstrip()
            left, sep, right = line.partition(" H3 ")
            if sep:
                AminoAcidsFirst.append( right[0] ) # really no need of extra grab=1 variable
                AminoAcidsLast.append( right[-1] ) # better than right[-grab:]
print ("first ",Counter(AminoAcidsFirst))
print ("last ",Counter(AminoAcidsLast))

Answer 4

如果您在 for 循环的底部或之后放置语句以打印 AminoAcidsFirst 和 AminoAcidsLast，您将在每次迭代中看到只是分配一个新值。您的意图应该是收集、包含或积累这些值，然后再将它们提供给 collections.Counter.

s = ['-1iqd_BA_0_CDRH3.pdb kabat H3 PDPDAFDV', '-1iqw_HL_0_CDRH3.pdb kabat H3 NRDYSNNWYFDV']

您的代码的直接修复方法是累积字符：

grab = 1
AminoAcidsFirst = ''
AminoAcidsLast = ''
for line in s:
   line=line.rstrip()
   left,sep,right=line.partition(" H3 ")
   if sep:
         AminoAcidsFirst += right[:grab] 
         AminoAcidsLast += right[-grab:]
print ("first ",collections.Counter(AminoAcidsFirst))
print ("last ",collections.Counter(AminoAcidsLast))

另一种方法是按需制作角色。定义一个生成器函数，它将产生你想要计数的东西

def f(iterable):
    for thing in iterable:
        left, sep, right = thing.partition(' H3 ')
        if sep:
            yield right[0]
            yield right[-1]

然后将其提供给 collections.Counter

z = collections.Counter(f(s))

或者使用文件作为数据源：

with open('myfile.txt') as f1:
    # lines is a generator expression
    # that produces stripped lines
    lines = (line.strip() for line in f1)
    z = collections.Counter(f(lines))

如何计算python中分区字符的出现次数？

How do I count the occurences of characters of a partition in python?

python

list

counter

partition