同一时间段内使用 WMD 的文本相似度
Text similarity using WMD within the same time period
我有一个数据集
Title Year
0 Sport, there will be a match between United and Tottenham ... 2020
1 Forecasting says that it will be cold next week 2019
2 Sport, Mourinho is approaching the anniversary at Tottenham 2020
3 Sport, Tottenham are sixth favourites for the title behind Arsenal. 2020
4 Pochettino says clear-out of fringe players at Tottenham is inevitable. 2018
... ... ...
我想研究同一年内的文本相似度,而不是整个数据集中的文本相似度。为了找到最相似的文本,我使用了 WM 距离相似度。
对于两个文本将是:
word2vec_model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
distance = word2vec_model.wmdistance("string 1".split(), "string 2".split())
但是,我需要在同一年的句子中迭代距离,以获得每个文本与其他文本的相似性,从而在数据框中每行创建一个相似文本列表。
你能告诉我如何在同一年发布的文本中迭代 wmdistance 函数,以便为每个文本获得同一时期内最相似的文本吗?
为每个组生成一个距离矩阵,然后选择最小值应该可行。这将为您提供给定年份中最近的单个文档索引。如果您想要 n 个文档或类似的东西,您应该能够很容易地修改此代码。
from scipy.spatial.distance import pdist, squareform
def nearest_doc(group):
sq = squareform(pdist(group.to_numpy()[:,None], metric=lambda x, y:word2vec_model.wmdistance(x[0], y[0])))
return group.index.to_numpy()[np.argmin(np.where(sq==0, np.inf, sq), axis=1)]
df['nearest_doc'] = df.groupby('Year')['Title'].transform(nearest_doc)
结果:
Title Year nearest_doc
0 Sport, there will be a match between United an... 2020 3
1 Forecasting says that it will be cold next week 2019 1
2 Sport, Mourinho is approaching the anniversary... 2020 3
3 Sport, Tottenham are sixth favourites for the ... 2020 2
4 Pochettino says clear-out of fringe players at... 2018 4
我有一个数据集
Title Year
0 Sport, there will be a match between United and Tottenham ... 2020
1 Forecasting says that it will be cold next week 2019
2 Sport, Mourinho is approaching the anniversary at Tottenham 2020
3 Sport, Tottenham are sixth favourites for the title behind Arsenal. 2020
4 Pochettino says clear-out of fringe players at Tottenham is inevitable. 2018
... ... ...
我想研究同一年内的文本相似度,而不是整个数据集中的文本相似度。为了找到最相似的文本,我使用了 WM 距离相似度。 对于两个文本将是:
word2vec_model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
distance = word2vec_model.wmdistance("string 1".split(), "string 2".split())
但是,我需要在同一年的句子中迭代距离,以获得每个文本与其他文本的相似性,从而在数据框中每行创建一个相似文本列表。 你能告诉我如何在同一年发布的文本中迭代 wmdistance 函数,以便为每个文本获得同一时期内最相似的文本吗?
为每个组生成一个距离矩阵,然后选择最小值应该可行。这将为您提供给定年份中最近的单个文档索引。如果您想要 n 个文档或类似的东西,您应该能够很容易地修改此代码。
from scipy.spatial.distance import pdist, squareform
def nearest_doc(group):
sq = squareform(pdist(group.to_numpy()[:,None], metric=lambda x, y:word2vec_model.wmdistance(x[0], y[0])))
return group.index.to_numpy()[np.argmin(np.where(sq==0, np.inf, sq), axis=1)]
df['nearest_doc'] = df.groupby('Year')['Title'].transform(nearest_doc)
结果:
Title Year nearest_doc
0 Sport, there will be a match between United an... 2020 3
1 Forecasting says that it will be cold next week 2019 1
2 Sport, Mourinho is approaching the anniversary... 2020 3
3 Sport, Tottenham are sixth favourites for the ... 2020 2
4 Pochettino says clear-out of fringe players at... 2018 4