在 python 中使用 gensim 模型时出错
Getting error while using gensim model in python
我现在在处理它时使用gensim模型通过训练数据制作了doc2vec文件。我收到一个错误。
我是 运行 下面的代码:-
模型 = Doc2Vec.load('sentiment140.d2v')
if len(sys.argv) < 4:
print ("Please input train_pos_count, train_neg_count and classifier!")
sys.exit()
train_pos_count = int(sys.argv[1])
train_neg_count = int(sys.argv[2])
test_pos_count = 144
test_neg_count = 144
print (train_pos_count)
print (train_neg_count)
vec_dim = 100
print ("Build training data set...")
train_arrays = numpy.zeros((train_pos_count + train_neg_count, vec_dim))
train_labels = numpy.zeros(train_pos_count + train_neg_count)
for i in range(train_pos_count):
prefix_train_pos = 'TRAIN_POS_' + str(i)
train_arrays[i] = model.docvecs[prefix_train_pos]
train_labels[i] = 1
for i in range(train_neg_count):
prefix_train_neg = 'TRAIN_NEG_' + str(i)
train_arrays[train_pos_count + i] = model.docvecs[prefix_train_neg]
train_labels[train_pos_count + i] = 0
print ("Build testing data set...")
test_arrays = numpy.zeros((test_pos_count + test_neg_count, vec_dim))
test_labels = numpy.zeros(test_pos_count + test_neg_count)
for i in range(test_pos_count):
prefix_test_pos = 'TEST_POS_' + str(i)
test_arrays[i] = model.docvecs[prefix_test_pos]
test_labels[i] = 1
for i in range(test_neg_count):
prefix_test_neg = 'TEST_NEG_' + str(i)
test_arrays[test_pos_count + i] = model.docvecs[prefix_test_neg]
test_labels[test_pos_count + i] = 0
print ("Begin classification...")
classifier = None
if sys.argv[3] == '-lr':
print ("Logistic Regressions is used...")
classifier = LogisticRegression()
elif sys.argv[3] == '-svm':
print ("Support Vector Machine is used...")
classifier = SVC()
elif sys.argv[3] == '-knn':
print ("K-Nearest Neighbors is used...")
classifier = KNeighborsClassifier(n_neighbors=10)
elif sys.argv[3] == '-rf':
print ("Random Forest is used...")
classifier = RandomForestClassifier()
classifier.fit(train_arrays, train_labels)
print ("Accuracy:", classifier.score(test_arrays, test_labels))
我遇到键盘错误 - "TEST_POS_72"
我想知道我做错了什么。
该错误的字面意思是模型中没有带有键 ('tag') TEST_POS_72
的文档向量。在训练期间一定没有任何带有该标签的文档出现。
您可以在 model.docvecs.offset2doctag
中看到模型中所有已知文档标签的列表。如果 TEST_POS_72
不存在,则无法通过 model.docvecs['TEST_POS_72']
访问文档向量。 (如果该列表为空,则文档向量被训练为可以通过纯 int 键访问——并且 model.docvecs[72]
将是访问文档向量的更合适的方式。)
(另外,Doc2Vec 不能很好地处理几百个文档的小语料库,并且屏幕截图中的警告 "Slow version of gensim.models.doc2vec is being used" 意味着 gensim 的优化 C 编译例程不是安装的一部分,并且训练将慢 100 倍或更多。)
我现在在处理它时使用gensim模型通过训练数据制作了doc2vec文件。我收到一个错误。 我是 运行 下面的代码:-
模型 = Doc2Vec.load('sentiment140.d2v')
if len(sys.argv) < 4:
print ("Please input train_pos_count, train_neg_count and classifier!")
sys.exit()
train_pos_count = int(sys.argv[1])
train_neg_count = int(sys.argv[2])
test_pos_count = 144
test_neg_count = 144
print (train_pos_count)
print (train_neg_count)
vec_dim = 100
print ("Build training data set...")
train_arrays = numpy.zeros((train_pos_count + train_neg_count, vec_dim))
train_labels = numpy.zeros(train_pos_count + train_neg_count)
for i in range(train_pos_count):
prefix_train_pos = 'TRAIN_POS_' + str(i)
train_arrays[i] = model.docvecs[prefix_train_pos]
train_labels[i] = 1
for i in range(train_neg_count):
prefix_train_neg = 'TRAIN_NEG_' + str(i)
train_arrays[train_pos_count + i] = model.docvecs[prefix_train_neg]
train_labels[train_pos_count + i] = 0
print ("Build testing data set...")
test_arrays = numpy.zeros((test_pos_count + test_neg_count, vec_dim))
test_labels = numpy.zeros(test_pos_count + test_neg_count)
for i in range(test_pos_count):
prefix_test_pos = 'TEST_POS_' + str(i)
test_arrays[i] = model.docvecs[prefix_test_pos]
test_labels[i] = 1
for i in range(test_neg_count):
prefix_test_neg = 'TEST_NEG_' + str(i)
test_arrays[test_pos_count + i] = model.docvecs[prefix_test_neg]
test_labels[test_pos_count + i] = 0
print ("Begin classification...")
classifier = None
if sys.argv[3] == '-lr':
print ("Logistic Regressions is used...")
classifier = LogisticRegression()
elif sys.argv[3] == '-svm':
print ("Support Vector Machine is used...")
classifier = SVC()
elif sys.argv[3] == '-knn':
print ("K-Nearest Neighbors is used...")
classifier = KNeighborsClassifier(n_neighbors=10)
elif sys.argv[3] == '-rf':
print ("Random Forest is used...")
classifier = RandomForestClassifier()
classifier.fit(train_arrays, train_labels)
print ("Accuracy:", classifier.score(test_arrays, test_labels))
我遇到键盘错误 - "TEST_POS_72"
我想知道我做错了什么。
该错误的字面意思是模型中没有带有键 ('tag') TEST_POS_72
的文档向量。在训练期间一定没有任何带有该标签的文档出现。
您可以在 model.docvecs.offset2doctag
中看到模型中所有已知文档标签的列表。如果 TEST_POS_72
不存在,则无法通过 model.docvecs['TEST_POS_72']
访问文档向量。 (如果该列表为空,则文档向量被训练为可以通过纯 int 键访问——并且 model.docvecs[72]
将是访问文档向量的更合适的方式。)
(另外,Doc2Vec 不能很好地处理几百个文档的小语料库,并且屏幕截图中的警告 "Slow version of gensim.models.doc2vec is being used" 意味着 gensim 的优化 C 编译例程不是安装的一部分,并且训练将慢 100 倍或更多。)