有人可以解释我在使用 Python 构建机器学习系统 运行 文件 blei_lda.py 时遇到的不受支持的操作数错误吗?

Can someone explain the unsupported operand error I'm getting while running the file blei_lda.py from Building Machine Learning Systems with Python?

我一直在尝试 运行 使用 Python 构建机器学习系统一书第 4 章的文件 blei_lda.py,但没有成功。我正在使用带有 Enthought Canopy GUI 的 Python 2.7。以下是创作者提供的实际文件,但在 github.

上也有多个副本

github repository

问题是我不断收到此错误:

TypeError                                 Traceback (most recent call last)
c:\users\matt\desktop\pythonprojects\pml\ch04\blei_lda.py in <module>()
    for ti in range(model.num_topics):
        words = model.show_topic(ti, 64)
 ------>tf = sum(f for f, w in words)
        with open('topics.txt', 'w') as output:
        output.write('\n'.join('{}:{}'.format(w, int(1000. * f / tf)) for f, w in words))
        output.write("\n\n\n")

TypeError: unsupported operand type(s) for +: 'int' and 'unicode' 

我已尝试创建一个解决方法,但无法找到任何完全有效的方法。

我也在整个网络和堆栈溢出中搜索了解决方案,但似乎我是唯一遇到此文件问题的人运行。

# This code is supporting material for the book
# Building Machine Learning Systems with Python
# by Willi Richert and Luis Pedro Coelho
# published by PACKT Publishing
#
# It is made available under the MIT License

from __future__ import print_function
from wordcloud import create_cloud
try:
    from gensim import corpora, models, matutils
except:
    print("import gensim failed.")
    print()
    print("Please install it")
    raise

import matplotlib.pyplot as plt
import numpy as np
from os import path

NUM_TOPICS = 100

# Check that data exists
if not path.exists('./data/ap/ap.dat'):
    print('Error: Expected data to be present at data/ap/')
    print('Please cd into ./data & run ./download_ap.sh')

# Load the data
corpus = corpora.BleiCorpus('./data/ap/ap.dat', './data/ap/vocab.txt')

# Build the topic model
model = models.ldamodel.LdaModel(
    corpus, num_topics=NUM_TOPICS, id2word=corpus.id2word, alpha=None)

# Iterate over all the topics in the model
for ti in range(model.num_topics):
    words = model.show_topic(ti, 64)
    tf = sum(f for f, w in words)
    with open('topics.txt', 'w') as output:
        output.write('\n'.join('{}:{}'.format(w, int(1000. * f / tf)) for f, w in words))
        output.write("\n\n\n")

# We first identify the most discussed topic, i.e., the one with the
# highest total weight

topics = matutils.corpus2dense(model[corpus], num_terms=model.num_topics)
weight = topics.sum(1)
max_topic = weight.argmax()


# Get the top 64 words for this topic
# Without the argument, show_topic would return only 10 words
words = model.show_topic(max_topic, 64)

# This function will actually check for the presence of pytagcloud and is otherwise a no-op
create_cloud('cloud_blei_lda.png', words)

num_topics_used = [len(model[doc]) for doc in corpus]
fig,ax = plt.subplots()
ax.hist(num_topics_used, np.arange(42))
ax.set_ylabel('Nr of documents')
ax.set_xlabel('Nr of topics')
fig.tight_layout()
fig.savefig('Figure_04_01.png')


# Now, repeat the same exercise using alpha=1.0
# You can edit the constant below to play around with this parameter
ALPHA = 1.0

model1 = models.ldamodel.LdaModel(
    corpus, num_topics=NUM_TOPICS, id2word=corpus.id2word, alpha=ALPHA)
num_topics_used1 = [len(model1[doc]) for doc in corpus]

fig,ax = plt.subplots()
ax.hist([num_topics_used, num_topics_used1], np.arange(42))
ax.set_ylabel('Nr of documents')
ax.set_xlabel('Nr of topics')

# The coordinates below were fit by trial and error to look good
ax.text(9, 223, r'default alpha')
ax.text(26, 156, 'alpha=1.0')
fig.tight_layout()
fig.savefig('Figure_04_02.png')

在这一行中:words = model.show_topic(ti, 64),words 是一个元组列表(unicode,float64)

例如。 [(u'school', 0.029515796999228502),(u'prom', 0.018586355008452897)]

所以在这一行中tf = sum(f for f, w in words) f 表示unicode,而w 表示float 值。并且您正在尝试对给出不受支持的操作数类型错误的 unicode 值求和。

将此行修改为 tf = sum(f for w, f in words) ,因此它现在将对浮点值求和。

同样修改这一行output.write('\n'.join('{}:{}'.format(w, int(1000. * f / tf)) for w, f in words))

因此代码片段将如下所示:

for ti in range(model.num_topics):
    words = model.show_topic(ti, 64)
    tf = sum(f for w, f in words)
    with open('topics.txt', 'w') as output:
    output.write('\n'.join('{}:{}'.format(w, int(1000. * f / tf)) for w, f in words))
    output.write("\n\n\n")