将字典中的文本向量与字典中的键相关联

Question

我有从 sqlite3 数据库中获取的文本。我想通过首先用 CountVectorizer 获取文本的向量来比较文本的相似性。我还有一本字典，用于存储与 messageID 相关的文本（作为字典键）。如何将每个文本向量与其 messageID 相关联？例如使用看起来像这样的向量数组

    [[1 1 0 1 1 0 1]
     [0 1 1 1 1 0 1]
     [0 1 0 1 1 1 1]]

我想知道 messageID = 0 有向量 [1 1 0 1 1 0 1]。矢量大小和数组的大小随着每条新消息而增加。

我尝试将字典放入 CountVectorizer 并尝试只评估一条消息，但都没有用。

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity as cosineSimilarity


def getVectorsAndFeatures(strs):
    text = [t for t in strs]
    vectorizer = CountVectorizer(text)
    vectorizer.fit(text)
    vectors = vectorizer.transform(text).toarray()
    features = vectorizer.get_feature_names()
    return vectors, features


text = ['This is the first sentence', 'This is the second sentence',
        'This is the third sentence']
messageDict = {0: 'This is the first sentence', 1: 'This is the second sentence', 2: 'This is the third sentence'}

vectors, features = getVectorsAndFeatures(text)

Answer 1

按照你的例子，你有一个消息 ID 和句子之间的映射

>>> text = ['This is the first sentence', 'This is the second sentence',
 'This is the third sentence']
>>> message_map = dict(zip(range(len(text)), text))
>>> message_map
{0: 'This is the first sentence', 1: 'This is the second sentence', 2: 'This is the third sentence'}

然后您想要使用 CountVectorizer 计算文本特征在每个句子中出现的次数。您可以运行进行与您相同的分析：

>>> vectorizer = CountVectorizer() 
>>> # Learn the vocabulary dictionary and return term-document matrixtransform 
>>> vectors = vectorizer.fit_transform(message_map.values()).toarray()
>>> vectors
array([[1, 1, 0, 1, 1, 0, 1],
       [0, 1, 1, 1, 1, 0, 1],
       [0, 1, 0, 1, 1, 1, 1]], dtype=int64)
>>> # get a mapping of the feature associated with each count entry
>>> features = vectorizer.get_feature_names()
>>> features
['first', 'is', 'second', 'sentence', 'the', 'third', 'this']

在 fit_transform() documentation 你有：

fit_transform(self, raw_documents, y=None)

Parameters: raw_documents : iterable

An iterable which yields either str, unicode or file objects.

Returns：X：数组，[n_samples，n_features]

Document-term matrix.

这意味着每个向量都以相同的顺序对应于输入文本中的一个句子（即 message_map.values()）。如果你想将ID映射到每个向量，你可以这样做（注意顺序被保留）：

>>> vector_map = dict(zip(message_map.keys(), vectors.tolist()))
>>> vector_map
{0: [1, 1, 0, 1, 1, 0, 1], 1: [0, 1, 1, 1, 1, 0, 1], 2: [0, 1, 0, 1, 1, 1, 1]}

我相信您要问的是拟合语料库，然后将新句子转换为特征计数向量。但请注意，任何不在原始语料库中的新词都将被忽略，如本例所示：

from sklearn.feature_extraction.text import CountVectorizer

corpus= ['This is the first sentence', 'This is the second sentence']
vectorizer = CountVectorizer() 
vectorizer.fit(corpus)

message_map = {0:'This is the first sentence', 1:'This is the second sentence', 2:'This is the third sentence'}

vector_map = { k: vectorizer.transform([v]).toarray().tolist()[0] for k, v in message_map.items()}

你获得：

>>> vector_map
{0: [1, 1, 0, 1, 1, 1], 1: [0, 1, 1, 1, 1, 1], 2: [0, 1, 0, 1, 1, 1]}

请注意，现在您的特征比以前少了一项，因为单词 third 不再是特征的一部分。

>>> vectorizer.get_feature_names()
['first', 'is', 'second', 'sentence', 'the', 'this']

这在计算向量之间的相似性时可能会有点问题，因为您要丢弃可用于区分向量的单词。

另一方面，您可以使用英语词典或其子集作为语料库并将其放入 vectorizer。但是，生成的向量会变得更加稀疏，并且可能会再次导致比较向量的问题。但这将取决于您用来计算向量之间距离的方法。

将字典中的文本向量与字典中的键相关联

Relate vector of text in dictionary to key in dictionary

python

scikit-learn

countvectorizer