如何使用 numpy 获取列表之间的最大相似度值?

How to get maximum similarity value between lists with numpy?

我有两个列表,想法是将一个列表的每个元素与第二个列表的所有元素进行比较,以提取具有最大相似性的元素。就像一个搜索引擎。

NLU 中使用的变量:

import numpy as np
import nlu
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

match = ['sentence_to_1',
 'sentence_to_2',
 'sentence_to_3',
  ...]

match2 = ['sentence_from_1',
 'sentence_from_2',
 'sentence_from_3',
 'sentence_from_1',
  ...]

pipe = nlu.load('xx.embed_sentence.bert_use_cmlm_multi_base_br')
df = pd.DataFrame({'one': match, 'two': match2})
predictions_1 = pipe.predict(df.one, output_level='document')
predictions_2 = pipe.predict(df.two, output_level='document')
e_col = 'sentence_embedding_bert_use_cmlm_multi_base_br'

predictions_1
output: 

  document          sentence_embedding_bert_use_cmlm_multi_base_br
0 sentence_to_1     [0.018291207030415535, -0.05946089327335358, -...
1 sentence_to_2     [0.04855785518884659, 0.09505678713321686, 0.3...
2 sentence_to_3     [0.15838183462619781, -0.19057893753051758, -0...

我已经用这种方法将一个列表的每个元素迭代到另一个列表的所有元素。我也非常感谢一个不会花费太多的想法,避免循环并列出理解例如

embed_mat = np.array([x for x in predictions_1[e_col]])
for i in match2:
  embedding = pipe.predict(i).iloc[0][e_col]
  m = np.array([embedding,]*len(df))
  sim_mat = cosine_similarity(m,embed_mat)
  print(sim_mat[0])
output:

[0.66812827 0.60055647 0.7160895  0.730334   0.76885804 0.54169453
 0.61199156 0.6578508  0.68869315 0.71536224 0.64135093 0.68568607
 0.7026179  0.64319338 0.60390899 0.64774842 0.62665297 0.61611091
 0.62738365 0.60333599 0.61464704 0.68141089 0.75263237 0.77213446
 0.75132462]
[0.72350056 0.65223669 0.67931278 0.62036637 0.67934842 0.62129368
 0.69825526 0.55635858 0.62417926 0.57909757 0.58463102 0.75053411
 0.62435311 0.66574652 0.6980762  0.72050293 0.64668413 0.62632569
 0.63648157 0.59476883 0.66401519 0.68794243 0.64723412 0.68215344
 0.66456176]
[0.84471557 0.75666135 0.75268174 0.71671225 0.74120815 0.78075131
 0.75810087 0.67278428 0.72912575 0.70120557 0.70225784 0.78829443
 0.70072031 0.76282867 0.78521151 0.76517436 0.7233746  0.71423372
 0.69281594 0.71363751 0.73811129 0.7231086  0.73386457 0.76077197
 0.75507266]
...

此数组的每个元素是其中一个句子与第二个列表中所有其他句子之间的相似度。

我的想法是我有一个这样的最终框架,对于我从列表中搜索的每个元素,我在第二个列表中找到具有最高相似度的元素。

  element_from       element_to       similarity
0 sentence_from_1    sentence_to_5    0.95424...
1 sentence_from_3    sentence_to_10   0.93333...
2 sentence_from_11   sentence_to_12   0.55112...

我什至用这种方法得到了结果

embed_mat = np.array([x for x in predictions_1[e_col]])
to = []
fro = []
sim = []
for i in match2:
  fro.append(i)
  embedding = pipe.predict(i).iloc[0][e_col]
  m = np.array([embedding,]*len(df))
  sim_mat = cosine_similarity(m,embed_mat)
  sim.append(max(sim_mat[0]))
  to.append(predictions_1['document'].values[sim_mat[0].argmax()])

pd.DataFrame({'From': fro, 'To': to, 'Similarity': sim})

但我认为有更好的方法可以解决它。更好的是我说更优化。

提供类似内容的替代解决方案:

# Cosine Similarity Calculation
def cosine_similarity(vector1, vector2):
    vector1 = np.array(vector1)
    vector2 = np.array(vector2)
    return np.dot(vector1, vector2) / (np.sqrt(np.sum(vector1**2)) * np.sqrt(np.sum(vector2**2))) 

for i in range(embed_mat.shape[0]):
    for j in range(i + 1, embed_mat.shape[0]):
        print("The cosine similarity between the documents ", i, "and", j, "is: ",
              cosine_similarity(embed_mat.toarray()[i], embed_mat.toarray()[j]))
Output:
The cosine similarity between the documents sentence_from_1 and sentence_to_5 is   0.95424