如何使用 NLTK snowball 词干提取器来提取西班牙语单词列表 Python

Question

我正在尝试使用 NLTK snowball 词干提取器来阻止西班牙语，并且我运行遇到了一些我不知道的编码问题。

这是我要操作的例句：

En diciembre, los precios de la energía subieron un 1,4 por ciento, los de la vivienda aumentaron un 0,1 por ciento y los precios de la vestimenta se mantuvieron sin cambios, mientras que los de los automóviles nuevos bajaron un 0,1 por ciento y los de los pasajes de avión cayeron el 0,7 por ciento.

首先，我使用以下代码从 xml 文件中读取了句子：

from nltk.stem.snowball import SnowballStemmer
import xml.etree.ElementTree as ET

stemmer = SnowballStemmer("spanish")
sentence = ET.tostring(context, encoding='utf-8', method="text").lower()

然后在将句子标记化以获得单词列表之后，我尝试对每个单词进行词干处理：

stem = stemmer.stem(words[headIndex - index])

错误来自这一行：

Traceback (most recent call last):
  File "main.py", line 150, in <module>
    main()
  File "main.py", line 142, in main
    vectorDict, vocabulary = englishXml(language)
  File "main.py", line 86, in englishXml
    stem = stemmer.stem(words[headIndex - index])
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/stem/snowball.py", line 3404, in stem
    r1, r2 = self._r1r2_standard(word, self.__vowels)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/stem/snowball.py", line 232, in _r1r2_standard
    if word[i] not in vowels and word[i-1] in vowels:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

我还尝试在没有 "utf-8" 编码的情况下从 xml 文件中读取句子，但问题是“.lower()”在其中不起作用案例:

sentence = ET.tostring(context, method="text").lower()

此时的错误变为：

Traceback (most recent call last):
  File "main.py", line 154, in <module>
    main()
  File "main.py", line 146, in main
    vectorDict, vocabulary = englishXml(language)
  File "main.py", line 63, in englishXml
    sentence = ET.tostring(context, method="text").lower()
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1126, in tostring
    ElementTree(element).write(file, encoding, method=method)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 814, in write
    _serialize_text(write, self._root, encoding)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1006, in _serialize_text
    write(part.encode(encoding))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 18: ordinal not in range(128)

提前致谢！

Answer 1

尝试在提取词干之前添加这个

sentence = sentence.decode('utf8')

Answer 2

只是为了确认最终代码是：

from nltk.stem.snowball import SnowballStemmer 
import xml.etree.ElementTree as ET stemmer = SnowballStemmer("spanish") 

sentence = ET.tostring(context, encoding='utf-8', method="text").lower()
sentence = sentence.decode('utf8')
stem = stemmer.stem(words[headIndex - index])

Answer 3

from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer('spanish')
stemmed_spanish = [stemmer.stem(item) for item in spanish_words]

如何使用 NLTK snowball 词干提取器来提取西班牙语单词列表 Python

How to use NLTK snowball stemmer to stem a list of Spanish words Python

python

nltk