Python中如何使用gensim进行字符串语义匹配？

Question

在python中，我们如何确定一个字符串是否与我们的短语有语义关系？

示例：

我们的短语是：

'Fruit and Vegetables'

我们要检查语义关系的字符串是：

'I have an apple in my basket', 'I have a car in my house'

结果：

据我们所知，第一项 I have an apple in my basket 与我们的短语有关。

Answer 1

您可以使用 gensim 库来实现 MatchSemantic 并将这样的代码编写为一个函数 (see full code in here):

初始化

安装 gensim 和 numpy:

pip install numpy
pip install gensim

代码

首先要落实要求

from re import sub
import numpy as np
from gensim.utils import simple_preprocess
import gensim.downloader as api
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from gensim.similarities import SparseTermSimilarityMatrix, WordEmbeddingSimilarityIndex, SoftCosineSimilarity

使用此功能检查字符串和句子是否与您想要的短语匹配。

def MatchSemantic(query_string, documents):
    stopwords = ['the', 'and', 'are', 'a']

    if len(documents) == 1: documents.append('')

    def preprocess(doc):
        # Tokenize, clean up input document string
        doc = sub(r'<img[^<>]+(>|$)', " image_token ", doc)
        doc = sub(r'<[^<>]+(>|$)', " ", doc)
        doc = sub(r'\[img_assist[^]]*?\]', " ", doc)
        doc = sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', " url_token ", doc)
        return [token for token in simple_preprocess(doc, min_len=0, max_len=float("inf")) if token not in stopwords]

    # Preprocess the documents, including the query string
    corpus = [preprocess(document) for document in documents]
    query = preprocess(query_string)

    # Load the model: this is a big file, can take a while to download and open
    glove = api.load("glove-wiki-gigaword-50")
    similarity_index = WordEmbeddingSimilarityIndex(glove)

    # Build the term dictionary, TF-idf model
    dictionary = Dictionary(corpus + [query])
    tfidf = TfidfModel(dictionary=dictionary)

    # Create the term similarity matrix.
    similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary, tfidf)

    query_tf = tfidf[dictionary.doc2bow(query)]

    index = SoftCosineSimilarity(
        tfidf[[dictionary.doc2bow(document) for document in corpus]],
        similarity_matrix)

    return index[query_tf]

注意： 如果运行第一次，进程条的代码将从 0% 变为 100% 用于下载 gensim 的 glove-wiki-gigaword-50 之后，所有内容都将被设置你可以简单地运行代码。

用法

例如，我们要查看 Fruit and Vegetables 是否匹配 documents

中的任何句子或项目

测试：

query_string = 'Fruit and Vegetables'
documents = ['I have an apple on my basket', 'I have a car in my house']
MatchSemantic(query_string, documents)

所以我们知道第一个项目 I have an apple on my basket 与 Fruit and Vegetables 有语义关系，所以它的分数将是 0.189 而第二个项目没有关系，所以它的分数将是 0

输出：

0.189    # I have an apple in my basket
0.000    # I have a car in my house

Python中如何使用gensim进行字符串语义匹配？

How to do string semantic matching using gensim in Python?

python

string

string-matching

semantics

初始化

代码

用法