遍历多个文件并计算多个字符串

Iterate over multiple files and count multiple strings

我想写一个代码来打开多个文本文件并计算每个文件中预定义字符串出现的次数。我想要的输出可以是文件中每个字符串出现次数总和的列表。

我想要的字符串是字典的值。

例如:

mi = { "key1": "string1", "key2": "string2", and so on..." }

为了打开 一个独特的文件 并实现我想要的计数,我得到了代码。检查以下内容:

mi = {} #my dictionary
data = open("test.txt", "r").read()
import collections 
od_mi = collections.OrderedDict(sorted(mi.items()))
count_occur = list()

for value in od_mi.values():
    count = data.count(value)
    count_occur.append(count)

lista_keys = []   
for key in od_mi.keys():
    lista_keys.append(key)

dic_final = dict(zip(lista_keys, count_occur))
od_mi_final = collections.OrderedDict(sorted(dic_final.items()))

print(od_mi_final) #A final dictionary with keys and values with the count of how many times each string occur. 

我的下一个目标是对多个文件执行相同的操作。我有一组根据模式命名的文本文件,例如"ABC 01.2015.txt ; ABC 02.2015.txt ...".

我制作了3个文本文件作为测试文件,在每个文件中,每个字符串出现一次。因此,在我的测试中 运行 我期望的输出是每个字符串计数 3。

mi = {}
import collections
od_mi = collections.OrderedDict(sorted(mi.items()))
for i in range(2,5):
for value in od_mi.values():
    x = "ABC" + " " + str(i) +".2015.txt"
    data = open(x, "r").read()
    contar = data.count(value)
    count_occur.append(contar)

 print(count_occur)

输出:

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

我意识到我的代码在循环中每次输入时都覆盖了计数。因此,我该如何解决这个问题?

您应该使用 Counter 来简化您的代码:

from collections import Counter

mi = {'key1': 'string1', 'key2': 'string2'}
count_occur = []
with open("test.txt", "r") as data_file:
    for data in data_file:
        count_occur.extend([d for d in data.split() if d in mi.values()])

print Counter(count_occur)

然后,要对多个文件进行处理,只需在文件列表上循环,例如:

from collections import Counter

count_occur = []
mi = {'key1': 'string1', 'key2': 'string2'}
files = ["ABC" + " " + str(i) +".2015.txt" for i in range(2,5)]

for file_c in files:
    with open(file_c, "r") as data_file:
        for data in data_file:
            count_occur.extend([d for d in data.split() if d in mi.values()])

print Counter(count_occur)

根据你的 mi dict 中的值创建一个 Counter,然后使用新的 Counter dict 键和每一行拆分词之间的交集:

mi = { "key1": "string1", "key2": "string2"}


import collections
from collections import Counter
counts = Counter(dict.fromkeys(mi.values(), 0))
for fle in list_of_file_names:
    with open(fle) as f:
        for words in map(str.split, f):
            counts.update(counts.viewkeys() & words)
print(counts)

如果您要查找完全匹配项并且要查找多个单词短语,最好的选择是使用带有单词边界的正则表达式:

from collections import Counter

import re

patt = re.compile("|".join([r"\b{}\b".format(v) for v in mi.values()]))
for fle in list_of_file_names:
    with open(fle) as f:
        for line in f:
            counts.update(patt.findall(line))
print(counts)

您可能会发现在 f.read() 上调用正则表达式假定文件内容适合内存:

with open(fle) as f:
     counts.update(patt.findall(f.read()))

常规 re 模块不适用于重叠匹配,如果您 pip install [regex][1] 设置重叠标志后将捕获重叠匹配:

import regex
import collections
from collections import Counter
counts = Counter(dict.fromkeys(mi.values(), 0))

patt = regex.compile("|".join([r"\b{}\b".format(v) for v in mi.values()]))
for fle in list_of_files:
    with open(fle) as f:
        for line in f:
            counts.update(patt.findall(line, overlapped=True))
print(counts)

如果我们稍微更改您的示例,您会看到不同之处:

In [30]: s = "O rótulo contém informações conflitantes sobre a natureza mineral e sintética."

In [31]: mi =  {"RTL. 10": "conflitantes sobre", "RTL. 11": "sobre"}
In [32]: patt = re.compile("|".join([r"\b{}\b".format(v) for v in mi.values()])) 
In [33]: patt.findall(s)
Out[33]: ['conflitantes sobre']

In [34]: patt = regex.compile("|".join([r"\b{}\b".format(v) for v in mi.values()]))

In [35]: patt.findall(s,overlapped=True)
Out[35]: ['conflitantes sobre', 'sobre']