用撇号来词干的结果应该是什么?
What should be the outcome of stemming a word with apostrophe?
我在 python 中使用 nltk.stem.porter.PorterStemmer
来获取词干。
当我得到 "women" 和 "women's" 的词干时,我分别得到不同的结果:"women" 和 "women'"。出于我的目的,我需要让两个词具有相同的词干。
在我的思路中,这两个词指的是同一个 idea/concept 并且几乎是同一个词,经过了转换,因此它们应该具有相同的词干。
为什么我得到两个不同的结果?这是正确的吗?
有必要在词形还原之前对您的文本进行标记化。
没有标记化:
>>> from nltk import word_tokenize
>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> [wnl.lemmatize(i) for i in "the woman's going home".split()]
['the', "woman's", 'going', 'home']
>>> [wnl.lemmatize(i) for i in "the women's home is in London".split()]
['the', "women's", 'home', 'is', 'in', 'London']
标记化:
>>> [wnl.lemmatize(i) for i in word_tokenize("the woman's going home")]
['the', 'woman', "'s", 'going', 'home']
>>> [wnl.lemmatize(i) for i in word_tokenize("the women's home is in London")]
['the', u'woman', "'s", 'home', 'is', 'in', 'London']
我在 python 中使用 nltk.stem.porter.PorterStemmer
来获取词干。
当我得到 "women" 和 "women's" 的词干时,我分别得到不同的结果:"women" 和 "women'"。出于我的目的,我需要让两个词具有相同的词干。
在我的思路中,这两个词指的是同一个 idea/concept 并且几乎是同一个词,经过了转换,因此它们应该具有相同的词干。
为什么我得到两个不同的结果?这是正确的吗?
有必要在词形还原之前对您的文本进行标记化。
没有标记化:
>>> from nltk import word_tokenize
>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> [wnl.lemmatize(i) for i in "the woman's going home".split()]
['the', "woman's", 'going', 'home']
>>> [wnl.lemmatize(i) for i in "the women's home is in London".split()]
['the', "women's", 'home', 'is', 'in', 'London']
标记化:
>>> [wnl.lemmatize(i) for i in word_tokenize("the woman's going home")]
['the', 'woman', "'s", 'going', 'home']
>>> [wnl.lemmatize(i) for i in word_tokenize("the women's home is in London")]
['the', u'woman', "'s", 'home', 'is', 'in', 'London']