了解 np.zeros 集群

Question

我正在学习聚类，我在几个教程中看到了一些我不太了解的相似性度量部分：

tfidf_vector = TfidfVectorizer()
tfidf_matrix = tfidf_vector.fit_transform(file)

#and/or

count_vector = CountVectorizer()
count_matrix = count_vector.fit_transform(file)

#AND HERE
file_size = len(file)
x = np.zeros((file_size, file_size))
#and here the similarity measures like cosine_similarity, jaccard...

for elm in range(file_size):
    x[elm] = cosine_similarity(tfidf_matrix[i:i+1], tfidf_matrix)

y = np.subtract(np.ones((file_size, file_size),dtype = np.float), x)

new_file = np.asarray(y)
w = new_file.reshape((1,file_size,file_size))

为什么我们需要 np.zeros？ tfidf_matrix/count_matrix 是否足以用于相似性度量？

Answer 1

此代码做同样的事情（我将 i 更改为 elm 因为它看起来像一个错字）

x = []
for elm in range(file_size):
    x.append(cosine_similarity(tfidf_matrix[elm:elm+1], tfidf_matrix)
x = np.asarray(x)

您也可以将 np.zeros 替换为 np.empty。预先创建数组然后填充数组的每个元素比附加到列表然后将其转换为 numpy 数组更有效。许多其他编程语言都需要像 numpy 一样预先分配数组，这就是为什么许多人选择以这种方式填充数组的原因。

然而，由于这是 python，您应该按照自己认为最容易阅读的方式进行阅读。

了解 np.zeros 集群

Understanding np.zeros in clustering

python

cluster-analysis

vector

similarity

python-3.x