打印包含和排除停用词的文本中 10 个最常出现的词
Print 10 most frequently occurring words of a text that including and excluding stopwords
我通过更改收到了 here 的问题。我有以下代码:
from nltk.corpus import stopwords
def content_text(text):
stopwords = nltk.corpus.stopwords.words('english')
content = [w for w in text if w.lower() in stopwords]
return content
如何打印 1)包括 和 2) 的文本中出现频率最高的 10 个词排除 个停用词?
不确定函数中的 is stopwords
,我想它需要是 in
但你可以使用带有 most_common(10)
的 Counterdict 来获得最常见的 10 个:
from collections import Counter
from string import punctuation
def content_text(text):
stopwords = set(nltk.corpus.stopwords.words('english')) # 0(1) lookups
with_stp = Counter()
without_stp = Counter()
with open(text) as f:
for line in f:
spl = line.split()
# update count off all words in the line that are in stopwrods
with_stp.update(w.lower().rstrip(punctuation) for w in spl if w.lower() in stopwords)
# update count off all words in the line that are not in stopwords
without_stp.update(w.lower().rstrip(punctuation) for w in spl if w not in stopwords)
# return a list with top ten most common words from each
return [x for x in with_stp.most_common(10)],[y for y in without_stp.most_common(10)]
wth_stop, wthout_stop = content_text(...)
如果您传入 nltk 文件对象,只需对其进行迭代:
def content_text(text):
stopwords = set(nltk.corpus.stopwords.words('english'))
with_stp = Counter()
without_stp = Counter()
for word in text:
# update count off all words in the line that are in stopwords
word = word.lower()
if word in stopwords:
with_stp.update([word])
else:
# update count off all words in the line that are not in stopwords
without_stp.update([word])
# return a list with top ten most common words from each
return [k for k,_ in with_stp.most_common(10)],[y for y,_ in without_stp.most_common(10)]
print(content_text(nltk.corpus.inaugural.words('2009-Obama.txt')))
nltk 方法包含标点符号,因此可能不是您想要的。
nltk中有一个FreqDist函数
import nltk
allWords = nltk.tokenize.word_tokenize(text)
allWordDist = nltk.FreqDist(w.lower() for w in allWords)
stopwords = nltk.corpus.stopwords.words('english')
allWordExceptStopDist = nltk.FreqDist(w.lower() for w in allWords if w not in stopwords)
提取 10 个最常见的:
mostCommon= allWordDist.most_common(10).keys()
你可以试试这个:
for word, frequency in allWordsDist.most_common(10):
print('%s;%d' % (word, frequency)).encode('utf-8')
我通过更改收到了 here 的问题。我有以下代码:
from nltk.corpus import stopwords
def content_text(text):
stopwords = nltk.corpus.stopwords.words('english')
content = [w for w in text if w.lower() in stopwords]
return content
如何打印 1)包括 和 2) 的文本中出现频率最高的 10 个词排除 个停用词?
不确定函数中的 is stopwords
,我想它需要是 in
但你可以使用带有 most_common(10)
的 Counterdict 来获得最常见的 10 个:
from collections import Counter
from string import punctuation
def content_text(text):
stopwords = set(nltk.corpus.stopwords.words('english')) # 0(1) lookups
with_stp = Counter()
without_stp = Counter()
with open(text) as f:
for line in f:
spl = line.split()
# update count off all words in the line that are in stopwrods
with_stp.update(w.lower().rstrip(punctuation) for w in spl if w.lower() in stopwords)
# update count off all words in the line that are not in stopwords
without_stp.update(w.lower().rstrip(punctuation) for w in spl if w not in stopwords)
# return a list with top ten most common words from each
return [x for x in with_stp.most_common(10)],[y for y in without_stp.most_common(10)]
wth_stop, wthout_stop = content_text(...)
如果您传入 nltk 文件对象,只需对其进行迭代:
def content_text(text):
stopwords = set(nltk.corpus.stopwords.words('english'))
with_stp = Counter()
without_stp = Counter()
for word in text:
# update count off all words in the line that are in stopwords
word = word.lower()
if word in stopwords:
with_stp.update([word])
else:
# update count off all words in the line that are not in stopwords
without_stp.update([word])
# return a list with top ten most common words from each
return [k for k,_ in with_stp.most_common(10)],[y for y,_ in without_stp.most_common(10)]
print(content_text(nltk.corpus.inaugural.words('2009-Obama.txt')))
nltk 方法包含标点符号,因此可能不是您想要的。
nltk中有一个FreqDist函数
import nltk
allWords = nltk.tokenize.word_tokenize(text)
allWordDist = nltk.FreqDist(w.lower() for w in allWords)
stopwords = nltk.corpus.stopwords.words('english')
allWordExceptStopDist = nltk.FreqDist(w.lower() for w in allWords if w not in stopwords)
提取 10 个最常见的:
mostCommon= allWordDist.most_common(10).keys()
你可以试试这个:
for word, frequency in allWordsDist.most_common(10):
print('%s;%d' % (word, frequency)).encode('utf-8')