Python中如何使用gensim进行字符串语义匹配?

How to do string semantic matching using gensim in Python?

在python中,我们如何确定一个字符串是否与我们的短语有语义关系?

示例:

我们的短语是:

'Fruit and Vegetables'

我们要检查语义关系的字符串是:

'I have an apple in my basket', 'I have a car in my house'

结果:

据我们所知,第一项 I have an apple in my basket 与我们的短语有关。

您可以使用 gensim 库来实现 MatchSemantic 并将这样的代码编写为一个函数 (see full code in here):

初始化


  1. 安装 gensimnumpy:
pip install numpy
pip install gensim

代码


  1. 首先要落实要求
from re import sub
import numpy as np
from gensim.utils import simple_preprocess
import gensim.downloader as api
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from gensim.similarities import SparseTermSimilarityMatrix, WordEmbeddingSimilarityIndex, SoftCosineSimilarity
  1. 使用此功能检查字符串和句子是否与您想要的短语匹配。
def MatchSemantic(query_string, documents):
    stopwords = ['the', 'and', 'are', 'a']

    if len(documents) == 1: documents.append('')

    def preprocess(doc):
        # Tokenize, clean up input document string
        doc = sub(r'<img[^<>]+(>|$)', " image_token ", doc)
        doc = sub(r'<[^<>]+(>|$)', " ", doc)
        doc = sub(r'\[img_assist[^]]*?\]', " ", doc)
        doc = sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', " url_token ", doc)
        return [token for token in simple_preprocess(doc, min_len=0, max_len=float("inf")) if token not in stopwords]

    # Preprocess the documents, including the query string
    corpus = [preprocess(document) for document in documents]
    query = preprocess(query_string)

    # Load the model: this is a big file, can take a while to download and open
    glove = api.load("glove-wiki-gigaword-50")
    similarity_index = WordEmbeddingSimilarityIndex(glove)

    # Build the term dictionary, TF-idf model
    dictionary = Dictionary(corpus + [query])
    tfidf = TfidfModel(dictionary=dictionary)

    # Create the term similarity matrix.
    similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary, tfidf)

    query_tf = tfidf[dictionary.doc2bow(query)]

    index = SoftCosineSimilarity(
        tfidf[[dictionary.doc2bow(document) for document in corpus]],
        similarity_matrix)

    return index[query_tf]

注意: 如果 运行 第一次,进程条的代码将从 0% 变为 100% 用于下载 gensimglove-wiki-gigaword-50 之后,所有内容都将被设置你可以简单地 运行 代码。

用法


例如,我们要查看 Fruit and Vegetables 是否匹配 documents

中的任何句子或项目

测试:

query_string = 'Fruit and Vegetables'
documents = ['I have an apple on my basket', 'I have a car in my house']
MatchSemantic(query_string, documents)

所以我们知道第一个项目 I have an apple on my basketFruit and Vegetables 有语义关系,所以它的分数将是 0.189 而第二个项目没有关系,所以它的分数将是 0

输出:

0.189    # I have an apple in my basket
0.000    # I have a car in my house