将 TF-IDF 与预训练的词嵌入相结合
Combining TF-IDF with pre-trained Word embeddings
我有一个网站元描述列表(128k 描述;每个描述平均 20-30 个词),并且正在尝试构建一个相似度排名器(如:向我显示与该网站最相似的 5 个网站元描述)
它与 TF-IDF uni- 和 bigram 一起工作得非常好,我认为我可以通过添加预训练的词嵌入来进一步改进它(spacy“en_core_web_lg" 确切地说)。 情节扭曲:根本不起作用。真的没猜对,突然吐出完全乱七八糟的建议。
下面是我的代码。关于我可能哪里出错的任何想法?我是否在监督一些非常直观的事情?
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
import sys
import pickle
import spacy
import scipy.sparse
from scipy.sparse import csr_matrix
import math
from sklearn.metrics.pairwise import linear_kernel
nlp=spacy.load('en_core_web_lg')
""" Tokenizing"""
def _keep_token(t):
return (t.is_alpha and
not (t.is_space or t.is_punct or
t.is_stop or t.like_num))
def _lemmatize_doc(doc):
return [ t.lemma_ for t in doc if _keep_token(t)]
def _preprocess(doc_list):
return [_lemmatize_doc(nlp(doc)) for doc in doc_list]
def dummy_fun(doc):
return doc
# Importing List of 128.000 Metadescriptions:
Web_data=open("./data/meta_descriptions","r", encoding="utf-8")
All_lines=Web_data.readlines()
# outputs a list of meta-descriptions consisting of lists of preprocessed tokens:
data=_preprocess(All_lines)
# TF-IDF Vectorizer:
vectorizer = TfidfVectorizer(min_df=10,tokenizer=dummy_fun,preprocessor=dummy_fun,)
tfidf = vectorizer.fit_transform(data)
dictionary = vectorizer.get_feature_names()
# Retrieving Word embedding vectors:
temp_array=[nlp(dictionary[i]).vector for i in range(len(dictionary))]
# I had to build the sparse array in several steps due to RAM constraints
# (with bigrams the vocabulary gets as large as >1m
dict_emb_sparse=scipy.sparse.csr_matrix(temp_array[0])
for arr in range(1,len(temp_array),100000):
print(str(arr))
dict_emb_sparse=scipy.sparse.vstack([dict_emb_sparse, scipy.sparse.csr_matrix(temp_array[arr:min(arr+100000,len(temp_array))])])
# Multiplying the TF-IDF matrix with the Word embeddings:
tfidf_emb_sparse=tfidf.dot(dict_emb_sparse)
# Translating the Query into the TF-IDF matrix and multiplying with the same Word Embeddings:
query_doc= vectorizer.transform(_preprocess(["World of Books is one of the largest online sellers of second-hand books in the world Our massive collection of over million cheap used books also comes with free delivery in the UK Whether it s the latest book release fiction or non-fiction we have what you are looking for"]))
query_emb_sparse=query_doc.dot(dict_emb_sparse)
# Calculating Cosine Similarities:
cosine_similarities = linear_kernel(query_emb_sparse, tfidf_emb_sparse).flatten()
related_docs_indices = cosine_similarities.argsort()[:-10:-1]
# Printing the Site descriptions with the highest match:
for ID in related_docs_indices:
print(All_lines[ID])
我从 this Github Rep 那里偷了部分 code/logic
有人在这里看到任何直接的错误吗?
非常感谢!!
您应该尝试在自己的语料库上训练嵌入。包有很多:gensim,glove。
您可以使用 BERT 的嵌入,而无需在您自己的语料库上重新训练。
你应该知道,不同语料库上的概率分布总是不同的。例如,关于食物的帖子中 'basketball' 的计数与体育新闻中的术语计数有很大不同,因此 'basketball' 在这些语料库中的词嵌入差距是巨大的。
我有一个网站元描述列表(128k 描述;每个描述平均 20-30 个词),并且正在尝试构建一个相似度排名器(如:向我显示与该网站最相似的 5 个网站元描述)
它与 TF-IDF uni- 和 bigram 一起工作得非常好,我认为我可以通过添加预训练的词嵌入来进一步改进它(spacy“en_core_web_lg" 确切地说)。 情节扭曲:根本不起作用。真的没猜对,突然吐出完全乱七八糟的建议。
下面是我的代码。关于我可能哪里出错的任何想法?我是否在监督一些非常直观的事情?
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
import sys
import pickle
import spacy
import scipy.sparse
from scipy.sparse import csr_matrix
import math
from sklearn.metrics.pairwise import linear_kernel
nlp=spacy.load('en_core_web_lg')
""" Tokenizing"""
def _keep_token(t):
return (t.is_alpha and
not (t.is_space or t.is_punct or
t.is_stop or t.like_num))
def _lemmatize_doc(doc):
return [ t.lemma_ for t in doc if _keep_token(t)]
def _preprocess(doc_list):
return [_lemmatize_doc(nlp(doc)) for doc in doc_list]
def dummy_fun(doc):
return doc
# Importing List of 128.000 Metadescriptions:
Web_data=open("./data/meta_descriptions","r", encoding="utf-8")
All_lines=Web_data.readlines()
# outputs a list of meta-descriptions consisting of lists of preprocessed tokens:
data=_preprocess(All_lines)
# TF-IDF Vectorizer:
vectorizer = TfidfVectorizer(min_df=10,tokenizer=dummy_fun,preprocessor=dummy_fun,)
tfidf = vectorizer.fit_transform(data)
dictionary = vectorizer.get_feature_names()
# Retrieving Word embedding vectors:
temp_array=[nlp(dictionary[i]).vector for i in range(len(dictionary))]
# I had to build the sparse array in several steps due to RAM constraints
# (with bigrams the vocabulary gets as large as >1m
dict_emb_sparse=scipy.sparse.csr_matrix(temp_array[0])
for arr in range(1,len(temp_array),100000):
print(str(arr))
dict_emb_sparse=scipy.sparse.vstack([dict_emb_sparse, scipy.sparse.csr_matrix(temp_array[arr:min(arr+100000,len(temp_array))])])
# Multiplying the TF-IDF matrix with the Word embeddings:
tfidf_emb_sparse=tfidf.dot(dict_emb_sparse)
# Translating the Query into the TF-IDF matrix and multiplying with the same Word Embeddings:
query_doc= vectorizer.transform(_preprocess(["World of Books is one of the largest online sellers of second-hand books in the world Our massive collection of over million cheap used books also comes with free delivery in the UK Whether it s the latest book release fiction or non-fiction we have what you are looking for"]))
query_emb_sparse=query_doc.dot(dict_emb_sparse)
# Calculating Cosine Similarities:
cosine_similarities = linear_kernel(query_emb_sparse, tfidf_emb_sparse).flatten()
related_docs_indices = cosine_similarities.argsort()[:-10:-1]
# Printing the Site descriptions with the highest match:
for ID in related_docs_indices:
print(All_lines[ID])
我从 this Github Rep 那里偷了部分 code/logic 有人在这里看到任何直接的错误吗? 非常感谢!!
您应该尝试在自己的语料库上训练嵌入。包有很多:gensim,glove。 您可以使用 BERT 的嵌入,而无需在您自己的语料库上重新训练。
你应该知道,不同语料库上的概率分布总是不同的。例如,关于食物的帖子中 'basketball' 的计数与体育新闻中的术语计数有很大不同,因此 'basketball' 在这些语料库中的词嵌入差距是巨大的。