如何查看 Python 程序中不同字符串块的内部?

How can I look inside of different chunks of strings in my Python program?

这里是初级程序员,还有很多东西要学。现在我正在处理一个非常大的文本文件,我想查看不同文本块的字符频率。例如,字符 "a" 和 "b" 出现在文本 [0:600] 与 [600:1200] 与 [1200:1800] 等中的频率如何。现在我只知道如何打印text[0:600],但我不知道如何编写语法来告诉 Python 仅在该文本块中查找 "a" 和 "b"。

我在想也许最好的写法是这样的,"for each of these chunks I have, tell me the frequency counts of 'a' and 'b'."这看起来可行吗?

非常感谢!

如果你想看的话,这是我目前所知道的。非常简单:

f = open('text.txt')
fa = f.read()

fa = fa.lower()
corn = re.sub(r'chr', '', fa) #delete chromosome title
potato = re.sub(r'[^atcg]', '', corn) #delete all other characters

print potato[0:50]

您可以定位文件光标并从那里读取:

with open('myfile.txt') as myfile:
    myfile.seek(1200)
    text = myfile.read(600)

这将从位置 1200 开始读取 600 个字节。请注意,当文本中有 Unicode 字符时,位置可能会偏移。

您已经知道如何拆分文本了。一般情况是:

interval = 600
chunks = [text[idx:idx+interval] for idx in range(0, len(text), interval)]

并计算字符串中子字符串(本例 a)的出现次数:

term = 'a'
term_counts = [chunk.count(term) for chunk in chunks]
# zip them together to make it nicer (not that zip returns an iterator with python 3.4)
chunks_with_counts = zip(chunks, term_counts)

示例:

>>> text = "The quick brown fox jumps over the lazy dog"
>>> interval = 3
>>> chunks = [text[idx:idx+interval] for idx in range(0, len(text), interval)]
>>> chunks
['The', ' qu', 'ick', ' br', 'own', ' fo', 'x j', 'ump', 's o', 'ver', ' th', 'e
 l', 'azy', ' do', 'g']
>>> term='o'
>>> term_counts = [chunk.count(term) for chunk in chunks]
>>> term_counts
[0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0]
>>> chunks_with_counts = zip(chunks, term_counts)
>>> list(chunks_with_counts)
[('The', 0), (' qu', 0), ('ick', 0), (' br', 0), ('own', 1), (' fo', 1), ('x j',
 0), ('ump', 0), ('s o', 1), ('ver', 0), (' th', 0), ('e l', 0), ('azy', 0), ('
do', 1), ('g', 0)]

是的,这似乎可行。您可以分块循环遍历文本:

def compare_characters(chunk):
    # check for frequency of a and b or whatever
    pass

chunksize = 600
i = 0
while i*chunksize < len(text):
    compare_characters(text[i*chunksize:(i+1)*chunksize])
    i+=1