Gensim:词向量编码问题
Gensim: word vectors encoding problems
在 Gensim 2.2.0 中从具有 IMDB 电影评级的纯英文文本文件创建词向量后:
import gensim, logging
import smart_open, os
from nltk.tokenize import RegexpTokenizer
VEC_SIZE = 300
MIN_COUNT = 5
WORKERS = 4
data_path = './data/'
vectors_path = 'vectors.bin.gz'
class AllSentences(object):
def __init__(self, dirname):
self.dirname = dirname
self.read_err_cnt = 0
self.tokenizer = RegexpTokenizer('[\'a-zA-Z]+', discard_empty=True)
def __iter__(self):
for fname in os.listdir(self.dirname):
print(fname)
for line in open(os.path.join(self.dirname, fname)):
words = []
try:
for word in self.tokenizer.tokenize(line):
words.append(word)
yield words
except:
self.read_err_cnt += 1
sentences = AllSentences(data_path)
训练和保存模型:
model = gensim.models.Word2Vec(sentences, size=VEC_SIZE,
min_count=MIN_COUNT, workers=WORKERS)
word_vectors = model.wv
word_vectors.save(vectors_path)
然后尝试加载它:
vectors = KeyedVectors.load_word2vec_format(vectors_path,
binary=True,
unicode_errors='ignore')
我得到'UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0'异常(见下文).我尝试了 'encoding' 参数的不同组合,包括 'ISO-8859-1' 和 'Latin1'。 'binary=True/False'的不同组合。没有任何帮助 - 相同的异常,无论使用什么参数。怎么了?如何使加载向量起作用?
异常:
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-64-f353fa49685c> in <module>()
----> 1 w2v = get_w2v_vectors()
<ipython-input-63-cbbe0a76e837> in get_w2v_vectors()
3 vectors = KeyedVectors.load_word2vec_format(word_vectors_path,
4 binary=True,
----> 5 unicode_errors='ignore')
6
7 #unicode_errors='ignore')
D:\usr\anaconda\lib\site-packages\gensim\models\keyedvectors.py in load_word2vec_format(cls, fname, fvocab, binary, encoding, unicode_errors, limit, datatype)
204 logger.info("loading projection weights from %s", fname)
205 with utils.smart_open(fname) as fin:
--> 206 header = utils.to_unicode(fin.readline(), encoding=encoding)
207 vocab_size, vector_size = map(int, header.split()) # throws for invalid file format
208 if limit:
D:\usr\anaconda\lib\site-packages\gensim\utils.py in any2unicode(text, encoding, errors)
233 if isinstance(text, unicode):
234 return text
--> 235 return unicode(text, encoding, errors=errors)
236 to_unicode = any2unicode
237
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
如果您使用 gensim 的本机 save()
方法保存向量,则应使用本机 load()
方法加载它们。
如果您想使用 load_word2vec_format()
加载矢量,则需要使用 save_word2vec_format()
保存它们。 (您将以这种方式丢失一些信息,例如 KeyedVectors.vocab
字典项中的确切出现次数。)
在 Gensim 2.2.0 中从具有 IMDB 电影评级的纯英文文本文件创建词向量后:
import gensim, logging
import smart_open, os
from nltk.tokenize import RegexpTokenizer
VEC_SIZE = 300
MIN_COUNT = 5
WORKERS = 4
data_path = './data/'
vectors_path = 'vectors.bin.gz'
class AllSentences(object):
def __init__(self, dirname):
self.dirname = dirname
self.read_err_cnt = 0
self.tokenizer = RegexpTokenizer('[\'a-zA-Z]+', discard_empty=True)
def __iter__(self):
for fname in os.listdir(self.dirname):
print(fname)
for line in open(os.path.join(self.dirname, fname)):
words = []
try:
for word in self.tokenizer.tokenize(line):
words.append(word)
yield words
except:
self.read_err_cnt += 1
sentences = AllSentences(data_path)
训练和保存模型:
model = gensim.models.Word2Vec(sentences, size=VEC_SIZE,
min_count=MIN_COUNT, workers=WORKERS)
word_vectors = model.wv
word_vectors.save(vectors_path)
然后尝试加载它:
vectors = KeyedVectors.load_word2vec_format(vectors_path,
binary=True,
unicode_errors='ignore')
我得到'UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0'异常(见下文).我尝试了 'encoding' 参数的不同组合,包括 'ISO-8859-1' 和 'Latin1'。 'binary=True/False'的不同组合。没有任何帮助 - 相同的异常,无论使用什么参数。怎么了?如何使加载向量起作用?
异常:
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-64-f353fa49685c> in <module>()
----> 1 w2v = get_w2v_vectors()
<ipython-input-63-cbbe0a76e837> in get_w2v_vectors()
3 vectors = KeyedVectors.load_word2vec_format(word_vectors_path,
4 binary=True,
----> 5 unicode_errors='ignore')
6
7 #unicode_errors='ignore')
D:\usr\anaconda\lib\site-packages\gensim\models\keyedvectors.py in load_word2vec_format(cls, fname, fvocab, binary, encoding, unicode_errors, limit, datatype)
204 logger.info("loading projection weights from %s", fname)
205 with utils.smart_open(fname) as fin:
--> 206 header = utils.to_unicode(fin.readline(), encoding=encoding)
207 vocab_size, vector_size = map(int, header.split()) # throws for invalid file format
208 if limit:
D:\usr\anaconda\lib\site-packages\gensim\utils.py in any2unicode(text, encoding, errors)
233 if isinstance(text, unicode):
234 return text
--> 235 return unicode(text, encoding, errors=errors)
236 to_unicode = any2unicode
237
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
如果您使用 gensim 的本机 save()
方法保存向量,则应使用本机 load()
方法加载它们。
如果您想使用 load_word2vec_format()
加载矢量,则需要使用 save_word2vec_format()
保存它们。 (您将以这种方式丢失一些信息,例如 KeyedVectors.vocab
字典项中的确切出现次数。)