如何按文档中的句子获取词频总和?
How to get sum of word frequencies by sentence in a document?
我有一篇小文章(文档),我得到了这篇文档中所有token的词频。
现在,我希望将文档分解成句子,并获得每个句子的分数。 'Score'定义为句子中每个词的词频之和。
比如短文如下:
article = 'We encourage you to take time to read and understand the below information. The first section will help make sure that your investment objectives are still aligned with your current strategy.'
我这样得到单词的频率:
words = nltk.tokenize.word_tokenize(article)
fdist = FreqDist(words)
解决方案一定很简单,就像查找文章的标记以获得分数一样,但我似乎无法弄清楚。
理想情况下,输出应该类似于 sentScore = [7,5]
,这样我就可以轻松挑选出最前面的 n 个句子。在这种情况下 sentScore'
只是每个句子的词频之和(这里是两个句子)
编辑:我需要在句子级别将这些计数加在一起,我目前正在使用
拆分句子
sentences = tokenize.sent_tokenize(article)
这很聪明,可以解决句点标点的情况。本质上,频率应该在文章级别计算,然后通过对单个单词频率求和在句子级别进行。
谢谢!
获得所有单词的计数后,您需要将文章标记为句子,然后将句子标记为单词。那么每个句子都可以缩减为字数之和
from collections import Counter
words = nltk.tokenize.word_tokenize(article)
# I don't know what `freqDist` is in your code. Counter will create a
# dictionary of word counts.
word_count = Counter(words)
sentences = nltk.tokenize.sent_tokenize(article)
# sentence_words is a list of lists. The article is tokenized into sentences
# and each sentence into words
sentence_words = [nltk.tokenize.word_tokenize(sentence) for sentence in sentences]
sentence_scores = [sum(word_count[word] for word in sentence) for sentence in sentence_words]
对于您的示例文章 sentence_scores 是 [17, 22]
我有一篇小文章(文档),我得到了这篇文档中所有token的词频。 现在,我希望将文档分解成句子,并获得每个句子的分数。 'Score'定义为句子中每个词的词频之和。
比如短文如下:
article = 'We encourage you to take time to read and understand the below information. The first section will help make sure that your investment objectives are still aligned with your current strategy.'
我这样得到单词的频率:
words = nltk.tokenize.word_tokenize(article)
fdist = FreqDist(words)
解决方案一定很简单,就像查找文章的标记以获得分数一样,但我似乎无法弄清楚。
理想情况下,输出应该类似于 sentScore = [7,5]
,这样我就可以轻松挑选出最前面的 n 个句子。在这种情况下 sentScore'
只是每个句子的词频之和(这里是两个句子)
编辑:我需要在句子级别将这些计数加在一起,我目前正在使用
拆分句子sentences = tokenize.sent_tokenize(article)
这很聪明,可以解决句点标点的情况。本质上,频率应该在文章级别计算,然后通过对单个单词频率求和在句子级别进行。
谢谢!
获得所有单词的计数后,您需要将文章标记为句子,然后将句子标记为单词。那么每个句子都可以缩减为字数之和
from collections import Counter
words = nltk.tokenize.word_tokenize(article)
# I don't know what `freqDist` is in your code. Counter will create a
# dictionary of word counts.
word_count = Counter(words)
sentences = nltk.tokenize.sent_tokenize(article)
# sentence_words is a list of lists. The article is tokenized into sentences
# and each sentence into words
sentence_words = [nltk.tokenize.word_tokenize(sentence) for sentence in sentences]
sentence_scores = [sum(word_count[word] for word in sentence) for sentence in sentence_words]
对于您的示例文章 sentence_scores 是 [17, 22]