检查 text/string 预定义列表元素的出现
Check text/string for occurence of predefined list elements
我有几个文本文件,我想将它们与包含表达式和单个词的词汇列表进行比较。所需的输出应该是一个字典,其中包含该列表的所有元素作为键,并将它们在文本文件中的相应频率作为值。要构建词汇列表,我需要将两个列表匹配在一起,
list1 = ['accounting',..., 'yields', 'zero-bond']
list2 = ['accounting', 'actual cost', ..., 'zero-bond']
vocabulary_list = ['accounting', 'actual cost', ..., 'yields', 'zero-bond']
sample_text = "Accounting experts predict an increase in yields for zero-bond and yields for junk-bonds."
desired_output = ['accounting':1, 'actual cost':0, ..., 'yields':2, 'zero-bond':1]
我尝试了什么:
def word_frequency(fileobj, words):
"""Build a Counter of specified words in fileobj"""
# initialise the counter to 0 for each word
ct = Counter(dict((w, 0) for w in words))
file_words = (word for line in fileobj for word in line)
filtered_words = (word for word in file_words if word in words)
return Counter(filtered_words)
def print_summary(filepath, ct):
words = sorted(ct.keys())
counts = [str(ct[k]) for k in words] with open(filepath[:-4] + '_dict' + '.txt', mode = 'w') as outfile:
outfile.write('{0}\n{1}\n{2}\n\n'.format(filepath,', '.join(words),', '.join(counts)))
return outfile
在 Python 中有什么方法可以做到这一点吗?我想出了如何使用单个单词的词汇列表 (1token) 来管理它,但无法找到多单词案例的解决方案?
如果您想考虑以标点符号结尾的单词,您还需要清理文本,即 'yields'
和 'yields!'
from collections import Counter
c = Counter()
import re
vocabulary_list = ['accounting', 'actual cost','yields', 'zero-bond']
d = {k: 0 for k in vocabulary_list}
sample_text = "Accounting experts predict actual costs an increase in yields for zero-bond and yields for junk-bonds.".lower()
splitted = set(sample_text.split())
c.update(splitted) # get count of all words
for k in d:
spl = k.split()
ln = len(spl)
# if we have multiple words we cannot split
if ln > 1:
check = re.findall(r'\b{0}\b'.format(k),sample_text)
if check:
d[k] += len(check)
# else we are looking for a single word
elif k in splitted:
d[k] += c[k]
print(d)
将所有列表链接成一个词汇字典:
from collections import Counter
from itertools import chain
import re
c = Counter()
l1,l2 = ['accounting', 'actual cost'], ['yields', 'zero-bond']
vocabulary_dict = {k:0 for k in chain(l1,l2)}
print(vocabulary_dict)
sample_text = "Accounting experts predict actual costs an increase in yields for zero-bond and yields for junk-bonds.".lower()
splitted = sample_text.split()
c.update(splitted)
for k in vocabulary_dict:
spl = k.split()
ln = len(spl)
if ln > 1:
check = re.findall(r'\b{0}\b'.format(k),sample_text)
if check:
vocabulary_dict[k] += len(check)
elif k in sample_text.split():
vocabulary_dict[k] += c[k]
print(vocabulary_dict)
您可以创建两个字典,一个用于短语,另一个用于单词,然后对每个字典进行传递。
我有几个文本文件,我想将它们与包含表达式和单个词的词汇列表进行比较。所需的输出应该是一个字典,其中包含该列表的所有元素作为键,并将它们在文本文件中的相应频率作为值。要构建词汇列表,我需要将两个列表匹配在一起,
list1 = ['accounting',..., 'yields', 'zero-bond']
list2 = ['accounting', 'actual cost', ..., 'zero-bond']
vocabulary_list = ['accounting', 'actual cost', ..., 'yields', 'zero-bond']
sample_text = "Accounting experts predict an increase in yields for zero-bond and yields for junk-bonds."
desired_output = ['accounting':1, 'actual cost':0, ..., 'yields':2, 'zero-bond':1]
我尝试了什么:
def word_frequency(fileobj, words):
"""Build a Counter of specified words in fileobj"""
# initialise the counter to 0 for each word
ct = Counter(dict((w, 0) for w in words))
file_words = (word for line in fileobj for word in line)
filtered_words = (word for word in file_words if word in words)
return Counter(filtered_words)
def print_summary(filepath, ct):
words = sorted(ct.keys())
counts = [str(ct[k]) for k in words] with open(filepath[:-4] + '_dict' + '.txt', mode = 'w') as outfile:
outfile.write('{0}\n{1}\n{2}\n\n'.format(filepath,', '.join(words),', '.join(counts)))
return outfile
在 Python 中有什么方法可以做到这一点吗?我想出了如何使用单个单词的词汇列表 (1token) 来管理它,但无法找到多单词案例的解决方案?
如果您想考虑以标点符号结尾的单词,您还需要清理文本,即 'yields'
和 'yields!'
from collections import Counter
c = Counter()
import re
vocabulary_list = ['accounting', 'actual cost','yields', 'zero-bond']
d = {k: 0 for k in vocabulary_list}
sample_text = "Accounting experts predict actual costs an increase in yields for zero-bond and yields for junk-bonds.".lower()
splitted = set(sample_text.split())
c.update(splitted) # get count of all words
for k in d:
spl = k.split()
ln = len(spl)
# if we have multiple words we cannot split
if ln > 1:
check = re.findall(r'\b{0}\b'.format(k),sample_text)
if check:
d[k] += len(check)
# else we are looking for a single word
elif k in splitted:
d[k] += c[k]
print(d)
将所有列表链接成一个词汇字典:
from collections import Counter
from itertools import chain
import re
c = Counter()
l1,l2 = ['accounting', 'actual cost'], ['yields', 'zero-bond']
vocabulary_dict = {k:0 for k in chain(l1,l2)}
print(vocabulary_dict)
sample_text = "Accounting experts predict actual costs an increase in yields for zero-bond and yields for junk-bonds.".lower()
splitted = sample_text.split()
c.update(splitted)
for k in vocabulary_dict:
spl = k.split()
ln = len(spl)
if ln > 1:
check = re.findall(r'\b{0}\b'.format(k),sample_text)
if check:
vocabulary_dict[k] += len(check)
elif k in sample_text.split():
vocabulary_dict[k] += c[k]
print(vocabulary_dict)
您可以创建两个字典,一个用于短语,另一个用于单词,然后对每个字典进行传递。