我如何解码 python 列表中的字节?
How can i decode bytes in a list in python?
我使用 python 2.7.8 ,我尝试使用名为 stem(param) 的内置函数获取单词的 origin/root,但我使用的列表是十六进制的,当i 运行 程序出现错误。
这是代码:
from nltk.stem.isri import ISRIStemmer
st = ISRIStemmer()
f=open("Hassan.txt","rU")
text=f.read()
text1=text.split()
for i in range(1,numOfWords): #numOfWords is var that contain the num of
print st.stem(text1[i]) # words in list (text1)
输出如下:
Warning (from warnings module):
File "C:\Python27\lib\site-packages\nltk\stem\isri.py", line 154
if token in self.stop_words:
UnicodeWarning: Unicode equal comparison failed to convert both
arguments to Unicode - interpreting them as being unequal
Traceback (most recent call last):
File "C:\Python27\Lib\mycorpus.py", line 81, in <module>
print st.stem(text1[i])
File "C:\Python27\lib\site-packages\nltk\stem\isri.py", line 156, in
stem
token = self.pre32(token) # remove length three and length two
prefixes in this order
File "C:\Python27\lib\site-packages\nltk\stem\isri.py", line 198, in
pre32
if word.startswith(pre3):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc8 in position 0:
ordinal not in range(128)
我该如何解决这个问题?!
您需要解码文件中的文本。假设您的文件编码为 UTF-8:
text=f.read().decode('utf-8')
我使用 python 2.7.8 ,我尝试使用名为 stem(param) 的内置函数获取单词的 origin/root,但我使用的列表是十六进制的,当i 运行 程序出现错误。 这是代码:
from nltk.stem.isri import ISRIStemmer
st = ISRIStemmer()
f=open("Hassan.txt","rU")
text=f.read()
text1=text.split()
for i in range(1,numOfWords): #numOfWords is var that contain the num of
print st.stem(text1[i]) # words in list (text1)
输出如下:
Warning (from warnings module):
File "C:\Python27\lib\site-packages\nltk\stem\isri.py", line 154
if token in self.stop_words:
UnicodeWarning: Unicode equal comparison failed to convert both
arguments to Unicode - interpreting them as being unequal
Traceback (most recent call last):
File "C:\Python27\Lib\mycorpus.py", line 81, in <module>
print st.stem(text1[i])
File "C:\Python27\lib\site-packages\nltk\stem\isri.py", line 156, in
stem
token = self.pre32(token) # remove length three and length two
prefixes in this order
File "C:\Python27\lib\site-packages\nltk\stem\isri.py", line 198, in
pre32
if word.startswith(pre3):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc8 in position 0:
ordinal not in range(128)
我该如何解决这个问题?!
您需要解码文件中的文本。假设您的文件编码为 UTF-8:
text=f.read().decode('utf-8')