计算文件中的字符串,一些单个单词,一些完整的句子
Count strings in a file, some single words, some full sentences
我想计算文件中某些单词和名称的出现次数。下面的代码错误地将 fish and chips
计为 fish
的一种情况和 chips
的一种情况,而不是 fish and chips
.
的一种计数
ngh.txt = 'test file with words fish, steak fish chips fish and chips'
import re
from collections import Counter
wanted = '''
"fish and chips"
fish
chips
steak
'''
cnt = Counter()
words = re.findall('\w+', open('ngh.txt').read().lower())
for word in words:
if word in wanted:
cnt[word] += 1
print cnt
输出:
Counter({'fish': 3, 'chips': 2, 'and': 1, 'steak': 1})
我想要的是:
Counter({'fish': 2, 'fish and chips': 1, 'chips': 1, 'steak': 1})
(理想情况下,我可以获得这样的输出:
fish: 2
fish and chips: 1
chips: 1
steak: 1
)
因此此解决方案适用于您的测试数据(并在测试数据中添加了一些术语,只是为了彻底),尽管它可能会得到改进。
关键是在单词列表中找到'and'的出现,然后用复合词替换'and'及其邻居(将邻居与'and'连接起来)并将其添加回列表,连同 'and'.
的副本
我还将 'wanted' 字符串转换为列表,以将 'fish and chips' 字符串作为不同的项目处理。
import re
from collections import Counter
# changed 'wanted' string to a list
wanted = ['fish and chips','fish','chips','steak', 'and']
cnt = Counter()
words = re.findall('\w+', open('ngh.txt').read().lower())
for word in words:
# look for 'and', replace it and neighbours with 'comp_word'
# slice, concatenate, and append to make new words list
if word == 'and':
and_pos = words.index('and')
comp_word = str(words[and_pos-1]) + ' and ' +str(words[and_pos+1])
words = words[:and_pos-1] + words[and_pos+2:]
words.append(comp_word)
words.append('and')
for word in words:
if word in wanted:
cnt[word] += 1
print cnt
您的文本输出将是:
Counter({'fish':2, 'and':1, 'steak':1, 'chips':1, 'fish and chips':1})
如上面的评论所述,不清楚为什么在您的理想输出中 want/expect 鱼的输出为 2,薯条的输出为 2,炸鱼薯条的输出为 1。我假设这是一个错字,因为上面的输出有 'chips':1
我建议使用两种适用于任何模式和任何文件的算法。
第一个算法的 运行 时间与(文件中的字符数)* 模式数成正比。
1> 对于每个模式,搜索所有模式并创建超级模式列表。这可以通过将一种模式(例如 'cat')与要搜索的所有模式进行匹配来完成。
patterns = ['cat', 'cat and dogs', 'cat and fish']
superpattern['cat'] = ['cat and dogs', 'cat and fish']
2> 在文件中搜索 'cat',假设结果是 cat_count
3> 现在在文件中搜索 'cat' 的每个晚餐模式并得到它们的计数
for (sp in superpattern['cat']) :
sp_count = match sp in file.
cat_count = cat_count - sp
这是蛮力的通用解决方案。如果我们在 Trie 中安排模式,应该能够提出线性时间解决方案。
根-->f-->i-->s-->h-->a 等等。
现在,当你在鱼的 h 处,而你没有得到 a 时,增加 fish_count 并转到 root。如果你得到 'a' 继续。任何时候你得到一些意想不到的东西,增加最近找到的模式的计数并转到根或转到某个其他节点(最长匹配前缀是该其他节点的后缀)。这是 Aho-Corasick 算法,您可以在维基百科或以下位置查找它:
http://www.cs.uku.fi/~kilpelai/BSA05/lectures/slides04.pdf
此解决方案与文件中的字符数成线性关系。
定义:
想要的项目: 正在文本中搜索的字符串。
要计算想要的项目,而不是在较长的想要的项目中重新计算它们,首先计算每个项目在字符串中出现的次数。接下来,从最长到最短的顺序浏览所需物品,当您遇到较长物品中出现的较小物品时,从较短物品中减去较长物品的结果数。例如,假设您想要的项目是 "a"、"a b" 和 "a b c",而您的文本是 "a/a/a b/a b c"。搜索每个单独的结果:{ "a": 4, "a b": 2, "a b c": 1 }。期望的结果是:{ "a b c": 1, "a b": #("a b") - #("a b c") = 2 - 1 = 1, "a": #("a") - #("a b c") - #("a b") = 4 - 1 - 1 = 2}.
def get_word_counts(text, wanted):
counts = {}; # The number of times a wanted item was read
# Dictionary mapping word lengths onto wanted items
# (in the form of a dictionary where keys are wanted items)
lengths = {};
# Find the number of times each wanted item occurs
for item in wanted:
matches = re.findall('\b' + item + '\b', text);
counts[item] = len(matches)
l = len(item) # Length of wanted item
# No wanted item of the same length has been encountered
if (l not in lengths):
# Create new dictionary of items of the given length
lengths[l] = {}
# Add wanted item to dictionary of items with the given length
lengths[l][item] = 1
# Get and sort lenths of wanted items from largest to smallest
keys = lengths.keys()
keys.sort(reverse=True)
# Remove overlapping wanted items from the counts working from
# largest strings to smallest strings
for i in range(1,len(keys)):
for j in range(0,i):
for i_item in lengths[keys[i]]:
for j_item in lengths[keys[j]]:
#print str(i)+','+str(j)+': '+i_item+' , '+j_item
matches = re.findall('\b' + i_item + '\b', j_item);
counts[i_item] -= len(matches) * counts[j_item]
return counts
以下代码包含测试用例:
tests = [
{
'text': 'test file with words fish, steak fish chips fish and '+
'chips and fries',
'wanted': ["fish and chips","fish","chips","steak"]
},
{
'text': 'fish, fish and chips, fish and chips and burgers',
'wanted': ["fish and chips","fish","fish and chips and burgers"]
},
{
'text': 'fish, fish and chips and burgers',
'wanted': ["fish and chips","fish","fish and chips and burgers"]
},
{
'text': 'My fish and chips and burgers. My fish and chips and '+
'burgers',
'wanted': ["fish and chips","fish","fish and chips and burgers"]
},
{
'text': 'fish fish fish',
'wanted': ["fish fish","fish"]
},
{
'text': 'fish fish fish',
'wanted': ["fish fish","fish","fish fish fish"]
}
]
for i in range(0,len(tests)):
test = tests[i]['text']
print test
print get_word_counts(test, tests[i]['wanted'])
print ''
输出结果如下:
test file with words fish, steak fish chips fish and chips and fries
{'fish and chips': 1, 'steak': 1, 'chips': 1, 'fish': 2}
fish, fish and chips, fish and chips and burgers
{'fish and chips': 1, 'fish and chips and burgers': 1, 'fish': 1}
fish, fish and chips and burgers
{'fish and chips': 0, 'fish and chips and burgers': 1, 'fish': 1}
My fish and chips and burgers. My fish and chips and burgers
{'fish and chips': 0, 'fish and chips and burgers': 2, 'fish': 0}
fish fish fish
{'fish fish': 1, 'fish': 1}
fish fish fish
{'fish fish fish': 1, 'fish fish': 0, 'fish': 0}
我想计算文件中某些单词和名称的出现次数。下面的代码错误地将 fish and chips
计为 fish
的一种情况和 chips
的一种情况,而不是 fish and chips
.
ngh.txt = 'test file with words fish, steak fish chips fish and chips'
import re
from collections import Counter
wanted = '''
"fish and chips"
fish
chips
steak
'''
cnt = Counter()
words = re.findall('\w+', open('ngh.txt').read().lower())
for word in words:
if word in wanted:
cnt[word] += 1
print cnt
输出:
Counter({'fish': 3, 'chips': 2, 'and': 1, 'steak': 1})
我想要的是:
Counter({'fish': 2, 'fish and chips': 1, 'chips': 1, 'steak': 1})
(理想情况下,我可以获得这样的输出:
fish: 2
fish and chips: 1
chips: 1
steak: 1
)
因此此解决方案适用于您的测试数据(并在测试数据中添加了一些术语,只是为了彻底),尽管它可能会得到改进。
关键是在单词列表中找到'and'的出现,然后用复合词替换'and'及其邻居(将邻居与'and'连接起来)并将其添加回列表,连同 'and'.
的副本我还将 'wanted' 字符串转换为列表,以将 'fish and chips' 字符串作为不同的项目处理。
import re
from collections import Counter
# changed 'wanted' string to a list
wanted = ['fish and chips','fish','chips','steak', 'and']
cnt = Counter()
words = re.findall('\w+', open('ngh.txt').read().lower())
for word in words:
# look for 'and', replace it and neighbours with 'comp_word'
# slice, concatenate, and append to make new words list
if word == 'and':
and_pos = words.index('and')
comp_word = str(words[and_pos-1]) + ' and ' +str(words[and_pos+1])
words = words[:and_pos-1] + words[and_pos+2:]
words.append(comp_word)
words.append('and')
for word in words:
if word in wanted:
cnt[word] += 1
print cnt
您的文本输出将是:
Counter({'fish':2, 'and':1, 'steak':1, 'chips':1, 'fish and chips':1})
如上面的评论所述,不清楚为什么在您的理想输出中 want/expect 鱼的输出为 2,薯条的输出为 2,炸鱼薯条的输出为 1。我假设这是一个错字,因为上面的输出有 'chips':1
我建议使用两种适用于任何模式和任何文件的算法。 第一个算法的 运行 时间与(文件中的字符数)* 模式数成正比。
1> 对于每个模式,搜索所有模式并创建超级模式列表。这可以通过将一种模式(例如 'cat')与要搜索的所有模式进行匹配来完成。
patterns = ['cat', 'cat and dogs', 'cat and fish']
superpattern['cat'] = ['cat and dogs', 'cat and fish']
2> 在文件中搜索 'cat',假设结果是 cat_count 3> 现在在文件中搜索 'cat' 的每个晚餐模式并得到它们的计数
for (sp in superpattern['cat']) :
sp_count = match sp in file.
cat_count = cat_count - sp
这是蛮力的通用解决方案。如果我们在 Trie 中安排模式,应该能够提出线性时间解决方案。 根-->f-->i-->s-->h-->a 等等。 现在,当你在鱼的 h 处,而你没有得到 a 时,增加 fish_count 并转到 root。如果你得到 'a' 继续。任何时候你得到一些意想不到的东西,增加最近找到的模式的计数并转到根或转到某个其他节点(最长匹配前缀是该其他节点的后缀)。这是 Aho-Corasick 算法,您可以在维基百科或以下位置查找它: http://www.cs.uku.fi/~kilpelai/BSA05/lectures/slides04.pdf
此解决方案与文件中的字符数成线性关系。
定义:
想要的项目: 正在文本中搜索的字符串。
要计算想要的项目,而不是在较长的想要的项目中重新计算它们,首先计算每个项目在字符串中出现的次数。接下来,从最长到最短的顺序浏览所需物品,当您遇到较长物品中出现的较小物品时,从较短物品中减去较长物品的结果数。例如,假设您想要的项目是 "a"、"a b" 和 "a b c",而您的文本是 "a/a/a b/a b c"。搜索每个单独的结果:{ "a": 4, "a b": 2, "a b c": 1 }。期望的结果是:{ "a b c": 1, "a b": #("a b") - #("a b c") = 2 - 1 = 1, "a": #("a") - #("a b c") - #("a b") = 4 - 1 - 1 = 2}.
def get_word_counts(text, wanted):
counts = {}; # The number of times a wanted item was read
# Dictionary mapping word lengths onto wanted items
# (in the form of a dictionary where keys are wanted items)
lengths = {};
# Find the number of times each wanted item occurs
for item in wanted:
matches = re.findall('\b' + item + '\b', text);
counts[item] = len(matches)
l = len(item) # Length of wanted item
# No wanted item of the same length has been encountered
if (l not in lengths):
# Create new dictionary of items of the given length
lengths[l] = {}
# Add wanted item to dictionary of items with the given length
lengths[l][item] = 1
# Get and sort lenths of wanted items from largest to smallest
keys = lengths.keys()
keys.sort(reverse=True)
# Remove overlapping wanted items from the counts working from
# largest strings to smallest strings
for i in range(1,len(keys)):
for j in range(0,i):
for i_item in lengths[keys[i]]:
for j_item in lengths[keys[j]]:
#print str(i)+','+str(j)+': '+i_item+' , '+j_item
matches = re.findall('\b' + i_item + '\b', j_item);
counts[i_item] -= len(matches) * counts[j_item]
return counts
以下代码包含测试用例:
tests = [
{
'text': 'test file with words fish, steak fish chips fish and '+
'chips and fries',
'wanted': ["fish and chips","fish","chips","steak"]
},
{
'text': 'fish, fish and chips, fish and chips and burgers',
'wanted': ["fish and chips","fish","fish and chips and burgers"]
},
{
'text': 'fish, fish and chips and burgers',
'wanted': ["fish and chips","fish","fish and chips and burgers"]
},
{
'text': 'My fish and chips and burgers. My fish and chips and '+
'burgers',
'wanted': ["fish and chips","fish","fish and chips and burgers"]
},
{
'text': 'fish fish fish',
'wanted': ["fish fish","fish"]
},
{
'text': 'fish fish fish',
'wanted': ["fish fish","fish","fish fish fish"]
}
]
for i in range(0,len(tests)):
test = tests[i]['text']
print test
print get_word_counts(test, tests[i]['wanted'])
print ''
输出结果如下:
test file with words fish, steak fish chips fish and chips and fries
{'fish and chips': 1, 'steak': 1, 'chips': 1, 'fish': 2}
fish, fish and chips, fish and chips and burgers
{'fish and chips': 1, 'fish and chips and burgers': 1, 'fish': 1}
fish, fish and chips and burgers
{'fish and chips': 0, 'fish and chips and burgers': 1, 'fish': 1}
My fish and chips and burgers. My fish and chips and burgers
{'fish and chips': 0, 'fish and chips and burgers': 2, 'fish': 0}
fish fish fish
{'fish fish': 1, 'fish': 1}
fish fish fish
{'fish fish fish': 1, 'fish fish': 0, 'fish': 0}