ValueError: need at least one array to concatenate in Top2Vec Error

ValueError: need at least one array to concatenate in Top2Vec Error

文档 = ['Consumer discretionary, healthcare and technology are preferred China equity sectors.', 'Consumer discretionary remains attractive, supported by China’s policy to revitalize domestic consumption. Prospects of further monetary and fiscal stimulus should reinforce the Chinese consumption theme.', 'The healthcare sector should be a key beneficiary of the coronavirus outbreak, on the back of increased demand for healthcare services and drugs.', 'The technology sector should benefit from increased demand for cloud services and hardware demand as China continues to recover from the coronavirus outbreak.', 'China consumer discretionary sector is preferred. In our assessment, the sector is likely to outperform the MSCI China Index in the coming 6-12 months.']

model = Top2Vec(docs, embedding_model = 'universal-sentence-encoder')

在 运行 上述命令时,我收到一个错误,该错误对于调试可能是错误的根本原因不是很明显?

错误:

2021-01-19 05:17:08,541 - top2vec - INFO - 培训预处理文件 INFO:top2vec:Pre-处理培训文件 2021-01-19 05:17:08,562 - top2vec - 信息 - 下载通用句子编码器模型 INFO:top2vec:Downloading 通用句子编码器模型 2021-01-19 05:17:13,250 - top2vec - 信息 - 创建关节 document/word 嵌入 INFO:top2vec:Creating联合document/word嵌入 WARNING:tensorflow:5 在对 的最后 6 次调用中触发了 tf.function 回溯。追踪成本高昂,追踪次数过多可能是由于 (1) 在循环中重复创建 @tf.function,(2) 传递不同形状的张量,(3) 传递 Python 对象而不是张量.对于 (1),请在循环外定义您的 @tf.function 。对于 (2),@tf.function 有 experimental_relax_shapes=True 选项放宽参数形状,可以避免不必要的回溯。对于(3),请参考https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function了解更多详情。 WARNING:tensorflow:5 在对 的最后 6 次调用中触发了 tf.function 回溯。追踪成本高昂,追踪次数过多可能是由于 (1) 在循环中重复创建 @tf.function,(2) 传递不同形状的张量,(3) 传递 Python 对象而不是张量.对于 (1),请在循环外定义您的 @tf.function 。对于 (2),@tf.function 有 experimental_relax_shapes=True 选项放宽参数形状,可以避免不必要的回溯。对于(3),请参考https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function了解更多详情。 2021-01-19 05:17:13,548 - top2vec - 信息 - 创建文档的低维嵌入 INFO:top2vec:Creating 文档的低维嵌入 2021-01-19 05:17:15,809 - top2vec - 信息 - 查找文档的密集区域 INFO:top2vec:Finding 文档密集区 2021-01-19 05:17:15,823 - top2vec - 信息 - 查找主题 INFO:top2vec:Finding 个主题

ValueError Traceback(最后一次调用) 在 () ----> 1 个模型 = Top2Vec(文档,embedding_model = 'universal-sentence-encoder')

2帧 <array_function 内部结构> 在 vstack(*args, **kwargs)

/usr/local/lib/python3.6/dist-packages/numpy/core/shape_base.py in vstack(tup) 攀上漂亮女局长之后281 第282话 --> 283 return _nx.concatenate(arrs, 0) 284 285

<array_function 内部结构> 在连接中(*args,**kwargs)

ValueError: 至少需要一个数组来连接

您需要使用更多的文档和独特的词才能找到至少 2 个主题。例如,我只需将您的列表乘以 10 就可以了:

from top2vec import Top2Vec

docs = ['Consumer discretionary, healthcare and technology are preferred China equity  sectors.',
'Consumer discretionary remains attractive, supported by China’s policy to revitalize domestic consumption. Prospects of further monetary and fiscal stimulus  should reinforce the Chinese consumption theme.',
'The healthcare sector should be a key beneficiary of the coronavirus outbreak,  on the back of increased demand for healthcare services and drugs.',
'The technology sector should benefit from increased demand for cloud services  and hardware demand as China continues to recover from the coronavirus  outbreak.',
'China consumer discretionary sector is preferred. In our assessment, the sector  is likely to outperform the MSCI China Index in the coming 6-12 months.']

docs = docs*10 
model = Top2Vec(docs, embedding_model='universal-sentence-encoder')
print(model)

<top2vec.Top2Vec.Top2Vec object at 0x13eef6210>

我有很少 (30) 个长达 130 000 个字符的长文档,所以我只是每隔 5000 个字符将它们分成较小的文档:


docs_split = []
for doc in docs:
    skip_n = 5000
    for i in range(0,130000,skip_n):
        docs_split.append(doc[i:i+skip_n])