Python - 计算列表中字符串的词频,列表中的词数变化
Python - count word frequency of string from list, number of words from list varies
我正在尝试创建一个程序来运行心理健康术语列表,查看研究摘要,并计算单词或短语出现的次数。我可以用单个词来解决这个问题,但我很难用多个词来做到这一点。我也尝试使用 NLTK ngrams,但由于心理健康列表中的单词数量不同(即,并非心理健康列表中的所有术语都是双字母组或三字母组),我也无法使用它。
我想强调的是,我知道拆分每个单词只会计算单个单词,但是,我只是停留在如何处理列表中不同数量的单词以在摘要中计数。
谢谢!
from collections import Counter
abstracts = ['This is a mental health abstract about anxiety and bipolar
disorder as well as other things.', 'While this abstract is not about ptsd
or any trauma-related illnesses, it does have a mental health focus.']
for x2 in abstracts:
mh_terms = ['bipolar disorder', 'anxiety', 'substance abuse disorder',
'ptsd', 'schizophrenia', 'mental health']
c = Counter(s.lower().replace('.', '') for s in x2.split())
for term in mh_terms:
term = term.replace(',','')
term = term.replace('.','')
xx = (term, c.get(term, 0))
mh_total_occur = sum(c.get(v, 0) for v in mh_terms)
print(mh_total_occur)
在我的示例中,两个摘要的计数均为 1,但我想要计数为 2。
问题是您永远不会匹配 "mental health",因为您只计算由“ ”字符分隔的单个单词的出现次数。
我不知道在这里使用计数器是否是正确的解决方案。如果您确实需要一个高度可扩展和可索引的解决方案,那么 n-gram 可能是可行的方法,但对于中小型问题,使用正则表达式模式匹配应该很快。
import re
abstracts = [
'This is a mental health abstract about anxiety and bipolar disorder as well as other things.',
'While this abstract is not about ptsd or any trauma-related illnesses, it does have a mental health focus.'
]
mh_terms = [
'bipolar disorder', 'anxiety', 'substance abuse disorder',
'ptsd', 'schizophrenia', 'mental health'
]
def _regex_word(text):
""" wrap text with special regex expression for start/end of words """
return '\b{}\b'.format(text)
def _normalize(text):
""" Remove any non alpha/numeric/space character """
return re.sub('[^a-z0-9 ]', '', text.lower())
normed_terms = [_normalize(term) for term in mh_terms]
for raw_abstract in abstracts:
print('--------')
normed_abstract = _normalize(raw_abstract)
# Search for all occurrences of chosen terms
found = {}
for norm_term in normed_terms:
pattern = _regex_word(norm_term)
found[norm_term] = len(re.findall(pattern, normed_abstract))
print('found = {!r}'.format(found))
mh_total_occur = sum(found.values())
print('mh_total_occur = {!r}'.format(mh_total_occur))
我尝试添加辅助函数和注释以明确我在做什么。
使用 \b
正则表达式控制字符在一般用例中很重要,因为它可以防止 "miss" 等可能的搜索词匹配 "dismiss" 等词。
我正在尝试创建一个程序来运行心理健康术语列表,查看研究摘要,并计算单词或短语出现的次数。我可以用单个词来解决这个问题,但我很难用多个词来做到这一点。我也尝试使用 NLTK ngrams,但由于心理健康列表中的单词数量不同(即,并非心理健康列表中的所有术语都是双字母组或三字母组),我也无法使用它。
我想强调的是,我知道拆分每个单词只会计算单个单词,但是,我只是停留在如何处理列表中不同数量的单词以在摘要中计数。
谢谢!
from collections import Counter
abstracts = ['This is a mental health abstract about anxiety and bipolar
disorder as well as other things.', 'While this abstract is not about ptsd
or any trauma-related illnesses, it does have a mental health focus.']
for x2 in abstracts:
mh_terms = ['bipolar disorder', 'anxiety', 'substance abuse disorder',
'ptsd', 'schizophrenia', 'mental health']
c = Counter(s.lower().replace('.', '') for s in x2.split())
for term in mh_terms:
term = term.replace(',','')
term = term.replace('.','')
xx = (term, c.get(term, 0))
mh_total_occur = sum(c.get(v, 0) for v in mh_terms)
print(mh_total_occur)
在我的示例中,两个摘要的计数均为 1,但我想要计数为 2。
问题是您永远不会匹配 "mental health",因为您只计算由“ ”字符分隔的单个单词的出现次数。
我不知道在这里使用计数器是否是正确的解决方案。如果您确实需要一个高度可扩展和可索引的解决方案,那么 n-gram 可能是可行的方法,但对于中小型问题,使用正则表达式模式匹配应该很快。
import re
abstracts = [
'This is a mental health abstract about anxiety and bipolar disorder as well as other things.',
'While this abstract is not about ptsd or any trauma-related illnesses, it does have a mental health focus.'
]
mh_terms = [
'bipolar disorder', 'anxiety', 'substance abuse disorder',
'ptsd', 'schizophrenia', 'mental health'
]
def _regex_word(text):
""" wrap text with special regex expression for start/end of words """
return '\b{}\b'.format(text)
def _normalize(text):
""" Remove any non alpha/numeric/space character """
return re.sub('[^a-z0-9 ]', '', text.lower())
normed_terms = [_normalize(term) for term in mh_terms]
for raw_abstract in abstracts:
print('--------')
normed_abstract = _normalize(raw_abstract)
# Search for all occurrences of chosen terms
found = {}
for norm_term in normed_terms:
pattern = _regex_word(norm_term)
found[norm_term] = len(re.findall(pattern, normed_abstract))
print('found = {!r}'.format(found))
mh_total_occur = sum(found.values())
print('mh_total_occur = {!r}'.format(mh_total_occur))
我尝试添加辅助函数和注释以明确我在做什么。
使用 \b
正则表达式控制字符在一般用例中很重要,因为它可以防止 "miss" 等可能的搜索词匹配 "dismiss" 等词。