imdb 审核编码错误

Question

我正在尝试构建一个 RNN 模型，该模型将评论分类为正面或负面情绪。

有一个词汇词典，在预处理中，我对一些索引序列进行了复习。
例如，

"This movie was best" --> [2,5,10,3]

当我尝试获取频繁的词汇并查看其内容时，出现此错误：

num of reviews 100
number of unique tokens : 4761
Traceback (most recent call last):
  File "preprocess.py", line 47, in <module>
    print(vocab)
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 10561: ordinal not in range(128)

代码如下：

from bs4 import BeautifulSoup
reviews = []
for item in os.listdir('imdbdata/train/pos')[:100]:
    with open("imdbdata/train/pos/"+item,'r',encoding='utf-8') as f:
        sample = BeautifulSoup(f.read()).get_text()
    sample = word_tokenize(sample.lower())
    reviews.append(sample)
print("num of reviews", len(reviews))
word_freq = nltk.FreqDist(itertools.chain(*reviews))
print("number of unique tokens : %d"%(len(word_freq.items())))
vocab = word_freq.most_common(vocab_size-1)
index_to_word = [x[0] for x in vocab]
index_to_word.append(unknown_token)
word_to_index = dict((w,i) for i,w in enumerate(index_to_word))
print(vocab)

问题是，在使用 Python 处理 NLP 问题时，如何摆脱这个 UnicodeEncodeError？特别是在使用 open 函数获取一些文本时。

Answer 1

您的终端似乎配置为 ASCII。因为字符 '\xe9' 超出了 ASCII 字符范围 (0x00-0x7F)，所以它无法在 ASCII 终端上打印。它也不能编码为ASCII：

>>> s = '\xe9'
>>> s.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 0: ordinal not in range(128)

您可以通过在打印时显式编码字符串并通过将不受支持的字符替换为 ?:

来处理编码错误来解决此问题

>>> print(s.encode('ascii', errors='replace'))
b'?'

该字符看起来像是带有尖音符 (é) 的小写字母 e 的 ISO-8859-1 编码。

您可以检查用于标准输出的编码。就我而言，它是 UTF-8，打印该字符没有问题：

>>> import sys
>>> sys.stdout.encoding
'UTF-8'
>>> print('\xe9')
é

您可以强制 Python 使用不同的默认编码；有一些讨论 here，但最好的方法是使用支持 UTF-8 的终端。

imdb 审核编码错误

Imdb review encoding error

python

nlp

rnn