如何查看 Python 程序中不同字符串块的内部？

Question

这里是初级程序员，还有很多东西要学。现在我正在处理一个非常大的文本文件，我想查看不同文本块的字符频率。例如，字符 "a" 和 "b" 出现在文本 [0:600] 与 [600:1200] 与 [1200:1800] 等中的频率如何。现在我只知道如何打印text[0:600]，但我不知道如何编写语法来告诉 Python 仅在该文本块中查找 "a" 和 "b"。

我在想也许最好的写法是这样的，"for each of these chunks I have, tell me the frequency counts of 'a' and 'b'."这看起来可行吗？

非常感谢！

如果你想看的话，这是我目前所知道的。非常简单：

f = open('text.txt')
fa = f.read()

fa = fa.lower()
corn = re.sub(r'chr', '', fa) #delete chromosome title
potato = re.sub(r'[^atcg]', '', corn) #delete all other characters

print potato[0:50]

Answer 1

您可以定位文件光标并从那里读取：

with open('myfile.txt') as myfile:
    myfile.seek(1200)
    text = myfile.read(600)

这将从位置 1200 开始读取 600 个字节。请注意，当文本中有 Unicode 字符时，位置可能会偏移。

Answer 2

您已经知道如何拆分文本了。一般情况是：

interval = 600
chunks = [text[idx:idx+interval] for idx in range(0, len(text), interval)]

并计算字符串中子字符串（本例 a）的出现次数：

term = 'a'
term_counts = [chunk.count(term) for chunk in chunks]
# zip them together to make it nicer (not that zip returns an iterator with python 3.4)
chunks_with_counts = zip(chunks, term_counts)

示例：

>>> text = "The quick brown fox jumps over the lazy dog"
>>> interval = 3
>>> chunks = [text[idx:idx+interval] for idx in range(0, len(text), interval)]
>>> chunks
['The', ' qu', 'ick', ' br', 'own', ' fo', 'x j', 'ump', 's o', 'ver', ' th', 'e
 l', 'azy', ' do', 'g']
>>> term='o'
>>> term_counts = [chunk.count(term) for chunk in chunks]
>>> term_counts
[0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0]
>>> chunks_with_counts = zip(chunks, term_counts)
>>> list(chunks_with_counts)
[('The', 0), (' qu', 0), ('ick', 0), (' br', 0), ('own', 1), (' fo', 1), ('x j',
 0), ('ump', 0), ('s o', 1), ('ver', 0), (' th', 0), ('e l', 0), ('azy', 0), ('
do', 1), ('g', 0)]

Answer 3

是的，这似乎可行。您可以分块循环遍历文本：

def compare_characters(chunk):
    # check for frequency of a and b or whatever
    pass

chunksize = 600
i = 0
while i*chunksize < len(text):
    compare_characters(text[i*chunksize:(i+1)*chunksize])
    i+=1

如何查看 Python 程序中不同字符串块的内部？

How can I look inside of different chunks of strings in my Python program?

python

string

chunks