波特和兰开斯特词干澄清
Porter and Lancaster stemming clarification
我正在使用 Porter
和 Lancaster
做 stemming
,我发现这些观察结果:
Input: replied
Porter: repli
Lancaster: reply
Input: twice
porter: twice
lancaster: twic
Input: came
porter: came
lancaster: cam
Input: In
porter: In
lancaster: in
我的问题是:
Lancaster
应该是 "aggressive" stemmer
但它与 replied
一起工作正常。为什么?
- 单词
In
在Porter
中保持不变,大写In
,为什么?
- 请注意
Lancaster
正在删除以 e
结尾的单词,为什么?
我无法理解这些概念。你能帮忙吗?
问:Lancaster 应该是 "aggressive" 词干分析器,但它与 replied
一起工作正常。为什么?
这是因为 https://github.com/nltk/nltk/pull/1654
中改进了 Lancaster 词干分析器的实现
如果我们看一下 https://github.com/nltk/nltk/blob/develop/nltk/stem/lancaster.py#L62,有一个后缀规则,可以更改 -ied > -y
default_rule_tuple = (
"ai*2.", # -ia > - if intact
"a*1.", # -a > - if intact
"bb1.", # -bb > -b
"city3s.", # -ytic > -ys
"ci2>", # -ic > -
"cn1t>", # -nc > -nt
"dd1.", # -dd > -d
"dei3y>", # -ied > -y
...)
该功能允许用户输入新规则,如果未添加其他规则,则它将在 parseRules
中使用 self.default_rule_tuple
,其中将应用 rule_tuple
https://github.com/nltk/nltk/blob/develop/nltk/stem/lancaster.py#L196
def parseRules(self, rule_tuple=None):
"""Validate the set of rules used in this stemmer.
If this function is called as an individual method, without using stem
method, rule_tuple argument will be compiled into self.rule_dictionary.
If this function is called within stem, self._rule_tuple will be used.
"""
# If there is no argument for the function, use class' own rule tuple.
rule_tuple = rule_tuple if rule_tuple else self._rule_tuple
valid_rule = re.compile("^[a-z]+\*?\d[a-z]*[>\.]?$")
# Empty any old rules from the rule set before adding new ones
self.rule_dictionary = {}
for rule in rule_tuple:
if not valid_rule.match(rule):
raise ValueError("The rule {0} is invalid".format(rule))
first_letter = rule[0:1]
if first_letter in self.rule_dictionary:
self.rule_dictionary[first_letter].append(rule)
else:
self.rule_dictionary[first_letter] = [rule]
default_rule_tuple
实际上来自 paice-husk stemmer which aka as the Lancaster stemmer https://github.com/nltk/nltk/pull/1661 =)
的 whoosh 实现
问:In 一词在 Porter 中仍然保持大写 In,为什么?
这太有趣了!而且很可能是一个错误。
>>> from nltk.stem import PorterStemmer
>>> porter = PorterStemmer()
>>> porter.stem('In')
'In'
如果我们看一下代码,首先 PorterStemmer.stem()
会变成小写,https://github.com/nltk/nltk/blob/develop/nltk/stem/porter.py#L651
def stem(self, word):
stem = word.lower()
if self.mode == self.NLTK_EXTENSIONS and word in self.pool:
return self.pool[word]
if self.mode != self.ORIGINAL_ALGORITHM and len(word) <= 2:
# With this line, strings of length 1 or 2 don't go through
# the stemming process, although no mention is made of this
# in the published algorithm.
return word
stem = self._step1a(stem)
stem = self._step1b(stem)
stem = self._step1c(stem)
stem = self._step2(stem)
stem = self._step3(stem)
stem = self._step4(stem)
stem = self._step5a(stem)
stem = self._step5b(stem)
return stem
但是如果我们查看代码,其他所有内容 return 都是 stem
,它是小写的,但是 有两个 if 子句 return 是一些未小写的原始 word
形式!!!
if self.mode == self.NLTK_EXTENSIONS and word in self.pool:
return self.pool[word]
if self.mode != self.ORIGINAL_ALGORITHM and len(word) <= 2:
# With this line, strings of length 1 or 2 don't go through
# the stemming process, although no mention is made of this
# in the published algorithm.
return word
第一个 if 子句检查单词是否在包含不规则单词及其词干的 self.pool
中。
第二个检查 len(word)
<= 2,然后 return 它是原始形式,在 "In" 的情况下第二个 if 子句 returns True ,因此原始的非小写形式 returned。
问:请注意 Lancaster 正在删除 "came" 中以 e
结尾的单词,为什么?
毫不奇怪也来自 default_rule_tuple
https://github.com/nltk/nltk/blob/develop/nltk/stem/lancaster.py#L67, 有个规则改了 -e > -
=)
问:如何从 default_rule_tuple
禁用 -e > -
规则?
(Un-)幸运的是,LancasterStemmer._rule_tuple
对象是一个不可变的元组,所以我们不能简单地从中删除一项,但我们可以覆盖它 =)
>>> from nltk.stem import LancasterStemmer
>>> lancaster = LancasterStemmer()
>>> lancaster.stem('came')
'cam'
# Create a new stemmer object to refresh the cache.
>>> lancaster = LancasterStemmer()
>>> temp_rule_list = list(lancaster._rule_tuple)
# Find the 'e1>' rule.
>>> lancaster._rule_tuple.index('e1>')
12
# Create a temporary rule list from the tuple.
>>> temp_rule_list = list(lancaster._rule_tuple)
# Remove the rule.
>>> temp_rule_list.pop(12)
'e1>'
# Override the `._rule_tuple` variable.
>>> lancaster._rule_tuple = tuple(temp_rule_list)
# Et voila!
>>> lancaster.stem('came')
'came'
我正在使用 Porter
和 Lancaster
做 stemming
,我发现这些观察结果:
Input: replied
Porter: repli
Lancaster: reply
Input: twice
porter: twice
lancaster: twic
Input: came
porter: came
lancaster: cam
Input: In
porter: In
lancaster: in
我的问题是:
Lancaster
应该是 "aggressive"stemmer
但它与replied
一起工作正常。为什么?- 单词
In
在Porter
中保持不变,大写In
,为什么? - 请注意
Lancaster
正在删除以e
结尾的单词,为什么?
我无法理解这些概念。你能帮忙吗?
问:Lancaster 应该是 "aggressive" 词干分析器,但它与 replied
一起工作正常。为什么?
这是因为 https://github.com/nltk/nltk/pull/1654
中改进了 Lancaster 词干分析器的实现如果我们看一下 https://github.com/nltk/nltk/blob/develop/nltk/stem/lancaster.py#L62,有一个后缀规则,可以更改 -ied > -y
default_rule_tuple = (
"ai*2.", # -ia > - if intact
"a*1.", # -a > - if intact
"bb1.", # -bb > -b
"city3s.", # -ytic > -ys
"ci2>", # -ic > -
"cn1t>", # -nc > -nt
"dd1.", # -dd > -d
"dei3y>", # -ied > -y
...)
该功能允许用户输入新规则,如果未添加其他规则,则它将在 parseRules
中使用 self.default_rule_tuple
,其中将应用 rule_tuple
https://github.com/nltk/nltk/blob/develop/nltk/stem/lancaster.py#L196
def parseRules(self, rule_tuple=None):
"""Validate the set of rules used in this stemmer.
If this function is called as an individual method, without using stem
method, rule_tuple argument will be compiled into self.rule_dictionary.
If this function is called within stem, self._rule_tuple will be used.
"""
# If there is no argument for the function, use class' own rule tuple.
rule_tuple = rule_tuple if rule_tuple else self._rule_tuple
valid_rule = re.compile("^[a-z]+\*?\d[a-z]*[>\.]?$")
# Empty any old rules from the rule set before adding new ones
self.rule_dictionary = {}
for rule in rule_tuple:
if not valid_rule.match(rule):
raise ValueError("The rule {0} is invalid".format(rule))
first_letter = rule[0:1]
if first_letter in self.rule_dictionary:
self.rule_dictionary[first_letter].append(rule)
else:
self.rule_dictionary[first_letter] = [rule]
default_rule_tuple
实际上来自 paice-husk stemmer which aka as the Lancaster stemmer https://github.com/nltk/nltk/pull/1661 =)
问:In 一词在 Porter 中仍然保持大写 In,为什么?
这太有趣了!而且很可能是一个错误。
>>> from nltk.stem import PorterStemmer
>>> porter = PorterStemmer()
>>> porter.stem('In')
'In'
如果我们看一下代码,首先 PorterStemmer.stem()
会变成小写,https://github.com/nltk/nltk/blob/develop/nltk/stem/porter.py#L651
def stem(self, word):
stem = word.lower()
if self.mode == self.NLTK_EXTENSIONS and word in self.pool:
return self.pool[word]
if self.mode != self.ORIGINAL_ALGORITHM and len(word) <= 2:
# With this line, strings of length 1 or 2 don't go through
# the stemming process, although no mention is made of this
# in the published algorithm.
return word
stem = self._step1a(stem)
stem = self._step1b(stem)
stem = self._step1c(stem)
stem = self._step2(stem)
stem = self._step3(stem)
stem = self._step4(stem)
stem = self._step5a(stem)
stem = self._step5b(stem)
return stem
但是如果我们查看代码,其他所有内容 return 都是 stem
,它是小写的,但是 有两个 if 子句 return 是一些未小写的原始 word
形式!!!
if self.mode == self.NLTK_EXTENSIONS and word in self.pool:
return self.pool[word]
if self.mode != self.ORIGINAL_ALGORITHM and len(word) <= 2:
# With this line, strings of length 1 or 2 don't go through
# the stemming process, although no mention is made of this
# in the published algorithm.
return word
第一个 if 子句检查单词是否在包含不规则单词及其词干的 self.pool
中。
第二个检查 len(word)
<= 2,然后 return 它是原始形式,在 "In" 的情况下第二个 if 子句 returns True ,因此原始的非小写形式 returned。
问:请注意 Lancaster 正在删除 "came" 中以 e
结尾的单词,为什么?
毫不奇怪也来自 default_rule_tuple
https://github.com/nltk/nltk/blob/develop/nltk/stem/lancaster.py#L67, 有个规则改了 -e > -
=)
问:如何从 default_rule_tuple
禁用 -e > -
规则?
(Un-)幸运的是,LancasterStemmer._rule_tuple
对象是一个不可变的元组,所以我们不能简单地从中删除一项,但我们可以覆盖它 =)
>>> from nltk.stem import LancasterStemmer
>>> lancaster = LancasterStemmer()
>>> lancaster.stem('came')
'cam'
# Create a new stemmer object to refresh the cache.
>>> lancaster = LancasterStemmer()
>>> temp_rule_list = list(lancaster._rule_tuple)
# Find the 'e1>' rule.
>>> lancaster._rule_tuple.index('e1>')
12
# Create a temporary rule list from the tuple.
>>> temp_rule_list = list(lancaster._rule_tuple)
# Remove the rule.
>>> temp_rule_list.pop(12)
'e1>'
# Override the `._rule_tuple` variable.
>>> lancaster._rule_tuple = tuple(temp_rule_list)
# Et voila!
>>> lancaster.stem('came')
'came'