Python中如何使用gensim进行字符串语义匹配?
How to do string semantic matching using gensim in Python?
在python中,我们如何确定一个字符串是否与我们的短语有语义关系?
示例:
我们的短语是:
'Fruit and Vegetables'
我们要检查语义关系的字符串是:
'I have an apple in my basket', 'I have a car in my house'
结果:
据我们所知,第一项 I have an apple in my basket
与我们的短语有关。
您可以使用 gensim
库来实现 MatchSemantic
并将这样的代码编写为一个函数 (see full code in here):
初始化
- 安装
gensim
和 numpy
:
pip install numpy
pip install gensim
代码
- 首先要落实要求
from re import sub
import numpy as np
from gensim.utils import simple_preprocess
import gensim.downloader as api
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from gensim.similarities import SparseTermSimilarityMatrix, WordEmbeddingSimilarityIndex, SoftCosineSimilarity
- 使用此功能检查字符串和句子是否与您想要的短语匹配。
def MatchSemantic(query_string, documents):
stopwords = ['the', 'and', 'are', 'a']
if len(documents) == 1: documents.append('')
def preprocess(doc):
# Tokenize, clean up input document string
doc = sub(r'<img[^<>]+(>|$)', " image_token ", doc)
doc = sub(r'<[^<>]+(>|$)', " ", doc)
doc = sub(r'\[img_assist[^]]*?\]', " ", doc)
doc = sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', " url_token ", doc)
return [token for token in simple_preprocess(doc, min_len=0, max_len=float("inf")) if token not in stopwords]
# Preprocess the documents, including the query string
corpus = [preprocess(document) for document in documents]
query = preprocess(query_string)
# Load the model: this is a big file, can take a while to download and open
glove = api.load("glove-wiki-gigaword-50")
similarity_index = WordEmbeddingSimilarityIndex(glove)
# Build the term dictionary, TF-idf model
dictionary = Dictionary(corpus + [query])
tfidf = TfidfModel(dictionary=dictionary)
# Create the term similarity matrix.
similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary, tfidf)
query_tf = tfidf[dictionary.doc2bow(query)]
index = SoftCosineSimilarity(
tfidf[[dictionary.doc2bow(document) for document in corpus]],
similarity_matrix)
return index[query_tf]
注意:
如果 运行 第一次,进程条的代码将从 0%
变为 100%
用于下载 gensim
的 glove-wiki-gigaword-50
之后,所有内容都将被设置你可以简单地 运行 代码。
用法
例如,我们要查看 Fruit and Vegetables
是否匹配 documents
中的任何句子或项目
测试:
query_string = 'Fruit and Vegetables'
documents = ['I have an apple on my basket', 'I have a car in my house']
MatchSemantic(query_string, documents)
所以我们知道第一个项目 I have an apple on my basket
与 Fruit and Vegetables
有语义关系,所以它的分数将是 0.189
而第二个项目没有关系,所以它的分数将是 0
输出:
0.189 # I have an apple in my basket
0.000 # I have a car in my house
在python中,我们如何确定一个字符串是否与我们的短语有语义关系?
示例:
我们的短语是:
'Fruit and Vegetables'
我们要检查语义关系的字符串是:
'I have an apple in my basket', 'I have a car in my house'
结果:
据我们所知,第一项 I have an apple in my basket
与我们的短语有关。
您可以使用 gensim
库来实现 MatchSemantic
并将这样的代码编写为一个函数 (see full code in here):
初始化
- 安装
gensim
和numpy
:
pip install numpy
pip install gensim
代码
- 首先要落实要求
from re import sub
import numpy as np
from gensim.utils import simple_preprocess
import gensim.downloader as api
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from gensim.similarities import SparseTermSimilarityMatrix, WordEmbeddingSimilarityIndex, SoftCosineSimilarity
- 使用此功能检查字符串和句子是否与您想要的短语匹配。
def MatchSemantic(query_string, documents):
stopwords = ['the', 'and', 'are', 'a']
if len(documents) == 1: documents.append('')
def preprocess(doc):
# Tokenize, clean up input document string
doc = sub(r'<img[^<>]+(>|$)', " image_token ", doc)
doc = sub(r'<[^<>]+(>|$)', " ", doc)
doc = sub(r'\[img_assist[^]]*?\]', " ", doc)
doc = sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', " url_token ", doc)
return [token for token in simple_preprocess(doc, min_len=0, max_len=float("inf")) if token not in stopwords]
# Preprocess the documents, including the query string
corpus = [preprocess(document) for document in documents]
query = preprocess(query_string)
# Load the model: this is a big file, can take a while to download and open
glove = api.load("glove-wiki-gigaword-50")
similarity_index = WordEmbeddingSimilarityIndex(glove)
# Build the term dictionary, TF-idf model
dictionary = Dictionary(corpus + [query])
tfidf = TfidfModel(dictionary=dictionary)
# Create the term similarity matrix.
similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary, tfidf)
query_tf = tfidf[dictionary.doc2bow(query)]
index = SoftCosineSimilarity(
tfidf[[dictionary.doc2bow(document) for document in corpus]],
similarity_matrix)
return index[query_tf]
注意:
如果 运行 第一次,进程条的代码将从 0%
变为 100%
用于下载 gensim
的 glove-wiki-gigaword-50
之后,所有内容都将被设置你可以简单地 运行 代码。
用法
例如,我们要查看 Fruit and Vegetables
是否匹配 documents
测试:
query_string = 'Fruit and Vegetables'
documents = ['I have an apple on my basket', 'I have a car in my house']
MatchSemantic(query_string, documents)
所以我们知道第一个项目 I have an apple on my basket
与 Fruit and Vegetables
有语义关系,所以它的分数将是 0.189
而第二个项目没有关系,所以它的分数将是 0
输出:
0.189 # I have an apple in my basket
0.000 # I have a car in my house