将字典中的文本向量与字典中的键相关联

Relate vector of text in dictionary to key in dictionary

我有从 sqlite3 数据库中获取的文本。我想通过首先用 CountVectorizer 获取文本的向量来比较文本的相似性。我还有一本字典,用于存储与 messageID 相关的文本(作为字典键)。如何将每个文本向量与其 messageID 相关联?例如使用看起来像这样的向量数组

    [[1 1 0 1 1 0 1]
     [0 1 1 1 1 0 1]
     [0 1 0 1 1 1 1]]

我想知道 messageID = 0 有向量 [1 1 0 1 1 0 1]。矢量大小和数组的大小随着每条新消息而增加。

我尝试将字典放入 CountVectorizer 并尝试只评估一条消息,但都没有用。

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity as cosineSimilarity


def getVectorsAndFeatures(strs):
    text = [t for t in strs]
    vectorizer = CountVectorizer(text)
    vectorizer.fit(text)
    vectors = vectorizer.transform(text).toarray()
    features = vectorizer.get_feature_names()
    return vectors, features


text = ['This is the first sentence', 'This is the second sentence',
        'This is the third sentence']
messageDict = {0: 'This is the first sentence', 1: 'This is the second sentence', 2: 'This is the third sentence'}

vectors, features = getVectorsAndFeatures(text)

按照你的例子,你有一个消息 ID 和句子之间的映射

>>> text = ['This is the first sentence', 'This is the second sentence',
 'This is the third sentence']
>>> message_map = dict(zip(range(len(text)), text))
>>> message_map
{0: 'This is the first sentence', 1: 'This is the second sentence', 2: 'This is the third sentence'}

然后您想要使用 CountVectorizer 计算文本特征在每个句子中出现的次数。您可以 运行 进行与您相同的分析:

>>> vectorizer = CountVectorizer() 
>>> # Learn the vocabulary dictionary and return term-document matrixtransform 
>>> vectors = vectorizer.fit_transform(message_map.values()).toarray()
>>> vectors
array([[1, 1, 0, 1, 1, 0, 1],
       [0, 1, 1, 1, 1, 0, 1],
       [0, 1, 0, 1, 1, 1, 1]], dtype=int64)
>>> # get a mapping of the feature associated with each count entry
>>> features = vectorizer.get_feature_names()
>>> features
['first', 'is', 'second', 'sentence', 'the', 'third', 'this']

fit_transform() documentation 你有:

fit_transform(self, raw_documents, y=None)

Parameters: raw_documents : iterable

An iterable which yields either str, unicode or file objects.

Returns:X:数组,[n_samples,n_features]

Document-term matrix.

这意味着每个向量都以相同的顺序对应于输入文本中的一个句子(即 message_map.values())。如果你想将ID映射到每个向量,你可以这样做(注意顺序被保留):

>>> vector_map = dict(zip(message_map.keys(), vectors.tolist()))
>>> vector_map
{0: [1, 1, 0, 1, 1, 0, 1], 1: [0, 1, 1, 1, 1, 0, 1], 2: [0, 1, 0, 1, 1, 1, 1]}

我相信您要问的是拟合语料库,然后将新句子转换为特征计数向量。但请注意,任何不在原始语料库中的新词都将被忽略,如本例所示:

from sklearn.feature_extraction.text import CountVectorizer

corpus= ['This is the first sentence', 'This is the second sentence']
vectorizer = CountVectorizer() 
vectorizer.fit(corpus)

message_map = {0:'This is the first sentence', 1:'This is the second sentence', 2:'This is the third sentence'}

vector_map = { k: vectorizer.transform([v]).toarray().tolist()[0] for k, v in message_map.items()}

你获得:

>>> vector_map
{0: [1, 1, 0, 1, 1, 1], 1: [0, 1, 1, 1, 1, 1], 2: [0, 1, 0, 1, 1, 1]}

请注意,现在您的特征比以前少了一项,因为单词 third 不再是特征的一部分。

>>> vectorizer.get_feature_names()
['first', 'is', 'second', 'sentence', 'the', 'this']

这在计算向量之间的相似性时可能会有点问题,因为您要丢弃可用于区分向量的单词。

另一方面,您可以使用英语词典或其子集作为语料库并将其放入 vectorizer。但是,生成的向量会变得更加稀疏,并且可能会再次导致比较向量的问题。但这将取决于您用来计算向量之间距离的方法。