nltk 分词时结合单复数、动词和副词的频率
nltk frequency combining singular and plural, verb and adverb when tokenizing
我想计算频率,但我想结合名词和动词及其副词形式的单数和复数形式。请原谅糟糕的句子。例如:"That aggressive person walk by the house over there, one of many houses aggressively."
分词并计算频率
import nltk
from nltk.tokenize import RegexpTokenizer
test = "That aggressive person walk by the house over there, one of many houses aggressively"
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(test)
fdist = nltk.FreqDist(tokens)
common=fdist.most_common(100)
输出:
[('houses', 1), ('aggressively', 1), ('by', 1), ('That', 1), ('house', 1), ('over', 1), ('there', 1), ('walk', 1), ('person', 1), ('many', 1), ('of', 1), ('aggressive', 1), ('one', 1), ('the', 1)]
我希望 house
和 houses
算作 ('house\houses', 2)
,aggressive
和 aggressively
算作 ('aggressive\aggressively',2)
。这可能吗?如果不是,我该如何继续使它看起来像那样?
您需要词形化。
NLTK 包含一个基于 WordNet 的词形还原器:
import nltk
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
lemmatizer = nltk.stem.WordNetLemmatizer()
test = "That aggressive person walk by the house over there, one of many houses aggressively"
tokens = tokenizer.tokenize(test)
lemmas = [lemmatizer.lemmatize(t) for t in tokens]
fdist = nltk.FreqDist(lemmas)
common = fdist.most_common(100)
这导致:
[('house', 2),
('aggressively', 1),
('by', 1),
('That', 1),
('over', 1),
('there', 1),
('walk', 1),
('person', 1),
('many', 1),
('of', 1),
('aggressive', 1),
('one', 1),
('the', 1)]
但是,aggressive 和 aggressively 不会被 WordNet 词形还原器合并。
还有其他的词形还原器,它们可能会做你想做的事。
不过,首先,您可能需要考虑词干提取:
stemmer = nltk.stem.PorterStemmer()
stems = [stemmer.stem(t) for t in tokens]
nltk.FreqDist(stems).most_common()
这给你:
[(u'aggress', 2),
(u'hous', 2),
(u'there', 1),
(u'That', 1),
(u'of', 1),
(u'over', 1),
(u'walk', 1),
(u'person', 1),
(u'mani', 1),
(u'the', 1),
(u'one', 1),
(u'by', 1)]
计数现在看起来不错!
但是,您可能会对词干不一定看起来像真实单词这一事实感到恼火...
我想计算频率,但我想结合名词和动词及其副词形式的单数和复数形式。请原谅糟糕的句子。例如:"That aggressive person walk by the house over there, one of many houses aggressively."
分词并计算频率
import nltk
from nltk.tokenize import RegexpTokenizer
test = "That aggressive person walk by the house over there, one of many houses aggressively"
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(test)
fdist = nltk.FreqDist(tokens)
common=fdist.most_common(100)
输出:
[('houses', 1), ('aggressively', 1), ('by', 1), ('That', 1), ('house', 1), ('over', 1), ('there', 1), ('walk', 1), ('person', 1), ('many', 1), ('of', 1), ('aggressive', 1), ('one', 1), ('the', 1)]
我希望 house
和 houses
算作 ('house\houses', 2)
,aggressive
和 aggressively
算作 ('aggressive\aggressively',2)
。这可能吗?如果不是,我该如何继续使它看起来像那样?
您需要词形化。
NLTK 包含一个基于 WordNet 的词形还原器:
import nltk
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
lemmatizer = nltk.stem.WordNetLemmatizer()
test = "That aggressive person walk by the house over there, one of many houses aggressively"
tokens = tokenizer.tokenize(test)
lemmas = [lemmatizer.lemmatize(t) for t in tokens]
fdist = nltk.FreqDist(lemmas)
common = fdist.most_common(100)
这导致:
[('house', 2),
('aggressively', 1),
('by', 1),
('That', 1),
('over', 1),
('there', 1),
('walk', 1),
('person', 1),
('many', 1),
('of', 1),
('aggressive', 1),
('one', 1),
('the', 1)]
但是,aggressive 和 aggressively 不会被 WordNet 词形还原器合并。 还有其他的词形还原器,它们可能会做你想做的事。 不过,首先,您可能需要考虑词干提取:
stemmer = nltk.stem.PorterStemmer()
stems = [stemmer.stem(t) for t in tokens]
nltk.FreqDist(stems).most_common()
这给你:
[(u'aggress', 2),
(u'hous', 2),
(u'there', 1),
(u'That', 1),
(u'of', 1),
(u'over', 1),
(u'walk', 1),
(u'person', 1),
(u'mani', 1),
(u'the', 1),
(u'one', 1),
(u'by', 1)]
计数现在看起来不错! 但是,您可能会对词干不一定看起来像真实单词这一事实感到恼火...