将字典中的文本向量与字典中的键相关联
Relate vector of text in dictionary to key in dictionary
我有从 sqlite3 数据库中获取的文本。我想通过首先用 CountVectorizer
获取文本的向量来比较文本的相似性。我还有一本字典,用于存储与 messageID
相关的文本(作为字典键)。如何将每个文本向量与其 messageID
相关联?例如使用看起来像这样的向量数组
[[1 1 0 1 1 0 1]
[0 1 1 1 1 0 1]
[0 1 0 1 1 1 1]]
我想知道 messageID = 0
有向量 [1 1 0 1 1 0 1]
。矢量大小和数组的大小随着每条新消息而增加。
我尝试将字典放入 CountVectorizer
并尝试只评估一条消息,但都没有用。
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity as cosineSimilarity
def getVectorsAndFeatures(strs):
text = [t for t in strs]
vectorizer = CountVectorizer(text)
vectorizer.fit(text)
vectors = vectorizer.transform(text).toarray()
features = vectorizer.get_feature_names()
return vectors, features
text = ['This is the first sentence', 'This is the second sentence',
'This is the third sentence']
messageDict = {0: 'This is the first sentence', 1: 'This is the second sentence', 2: 'This is the third sentence'}
vectors, features = getVectorsAndFeatures(text)
按照你的例子,你有一个消息 ID 和句子之间的映射
>>> text = ['This is the first sentence', 'This is the second sentence',
'This is the third sentence']
>>> message_map = dict(zip(range(len(text)), text))
>>> message_map
{0: 'This is the first sentence', 1: 'This is the second sentence', 2: 'This is the third sentence'}
然后您想要使用 CountVectorizer
计算文本特征在每个句子中出现的次数。您可以 运行 进行与您相同的分析:
>>> vectorizer = CountVectorizer()
>>> # Learn the vocabulary dictionary and return term-document matrixtransform
>>> vectors = vectorizer.fit_transform(message_map.values()).toarray()
>>> vectors
array([[1, 1, 0, 1, 1, 0, 1],
[0, 1, 1, 1, 1, 0, 1],
[0, 1, 0, 1, 1, 1, 1]], dtype=int64)
>>> # get a mapping of the feature associated with each count entry
>>> features = vectorizer.get_feature_names()
>>> features
['first', 'is', 'second', 'sentence', 'the', 'third', 'this']
在 fit_transform()
documentation 你有:
fit_transform(self, raw_documents, y=None)
Parameters: raw_documents : iterable
An iterable which yields either str, unicode or file objects.
Returns:X:数组,[n_samples,n_features]
Document-term matrix.
这意味着每个向量都以相同的顺序对应于输入文本中的一个句子(即 message_map.values()
)。如果你想将ID映射到每个向量,你可以这样做(注意顺序被保留):
>>> vector_map = dict(zip(message_map.keys(), vectors.tolist()))
>>> vector_map
{0: [1, 1, 0, 1, 1, 0, 1], 1: [0, 1, 1, 1, 1, 0, 1], 2: [0, 1, 0, 1, 1, 1, 1]}
我相信您要问的是拟合语料库,然后将新句子转换为特征计数向量。但请注意,任何不在原始语料库中的新词都将被忽略,如本例所示:
from sklearn.feature_extraction.text import CountVectorizer
corpus= ['This is the first sentence', 'This is the second sentence']
vectorizer = CountVectorizer()
vectorizer.fit(corpus)
message_map = {0:'This is the first sentence', 1:'This is the second sentence', 2:'This is the third sentence'}
vector_map = { k: vectorizer.transform([v]).toarray().tolist()[0] for k, v in message_map.items()}
你获得:
>>> vector_map
{0: [1, 1, 0, 1, 1, 1], 1: [0, 1, 1, 1, 1, 1], 2: [0, 1, 0, 1, 1, 1]}
请注意,现在您的特征比以前少了一项,因为单词 third
不再是特征的一部分。
>>> vectorizer.get_feature_names()
['first', 'is', 'second', 'sentence', 'the', 'this']
这在计算向量之间的相似性时可能会有点问题,因为您要丢弃可用于区分向量的单词。
另一方面,您可以使用英语词典或其子集作为语料库并将其放入 vectorizer
。但是,生成的向量会变得更加稀疏,并且可能会再次导致比较向量的问题。但这将取决于您用来计算向量之间距离的方法。
我有从 sqlite3 数据库中获取的文本。我想通过首先用 CountVectorizer
获取文本的向量来比较文本的相似性。我还有一本字典,用于存储与 messageID
相关的文本(作为字典键)。如何将每个文本向量与其 messageID
相关联?例如使用看起来像这样的向量数组
[[1 1 0 1 1 0 1]
[0 1 1 1 1 0 1]
[0 1 0 1 1 1 1]]
我想知道 messageID = 0
有向量 [1 1 0 1 1 0 1]
。矢量大小和数组的大小随着每条新消息而增加。
我尝试将字典放入 CountVectorizer
并尝试只评估一条消息,但都没有用。
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity as cosineSimilarity
def getVectorsAndFeatures(strs):
text = [t for t in strs]
vectorizer = CountVectorizer(text)
vectorizer.fit(text)
vectors = vectorizer.transform(text).toarray()
features = vectorizer.get_feature_names()
return vectors, features
text = ['This is the first sentence', 'This is the second sentence',
'This is the third sentence']
messageDict = {0: 'This is the first sentence', 1: 'This is the second sentence', 2: 'This is the third sentence'}
vectors, features = getVectorsAndFeatures(text)
按照你的例子,你有一个消息 ID 和句子之间的映射
>>> text = ['This is the first sentence', 'This is the second sentence',
'This is the third sentence']
>>> message_map = dict(zip(range(len(text)), text))
>>> message_map
{0: 'This is the first sentence', 1: 'This is the second sentence', 2: 'This is the third sentence'}
然后您想要使用 CountVectorizer
计算文本特征在每个句子中出现的次数。您可以 运行 进行与您相同的分析:
>>> vectorizer = CountVectorizer()
>>> # Learn the vocabulary dictionary and return term-document matrixtransform
>>> vectors = vectorizer.fit_transform(message_map.values()).toarray()
>>> vectors
array([[1, 1, 0, 1, 1, 0, 1],
[0, 1, 1, 1, 1, 0, 1],
[0, 1, 0, 1, 1, 1, 1]], dtype=int64)
>>> # get a mapping of the feature associated with each count entry
>>> features = vectorizer.get_feature_names()
>>> features
['first', 'is', 'second', 'sentence', 'the', 'third', 'this']
在 fit_transform()
documentation 你有:
fit_transform(self, raw_documents, y=None)
Parameters: raw_documents : iterable
An iterable which yields either str, unicode or file objects.
Returns:X:数组,[n_samples,n_features]
Document-term matrix.
这意味着每个向量都以相同的顺序对应于输入文本中的一个句子(即 message_map.values()
)。如果你想将ID映射到每个向量,你可以这样做(注意顺序被保留):
>>> vector_map = dict(zip(message_map.keys(), vectors.tolist()))
>>> vector_map
{0: [1, 1, 0, 1, 1, 0, 1], 1: [0, 1, 1, 1, 1, 0, 1], 2: [0, 1, 0, 1, 1, 1, 1]}
我相信您要问的是拟合语料库,然后将新句子转换为特征计数向量。但请注意,任何不在原始语料库中的新词都将被忽略,如本例所示:
from sklearn.feature_extraction.text import CountVectorizer
corpus= ['This is the first sentence', 'This is the second sentence']
vectorizer = CountVectorizer()
vectorizer.fit(corpus)
message_map = {0:'This is the first sentence', 1:'This is the second sentence', 2:'This is the third sentence'}
vector_map = { k: vectorizer.transform([v]).toarray().tolist()[0] for k, v in message_map.items()}
你获得:
>>> vector_map
{0: [1, 1, 0, 1, 1, 1], 1: [0, 1, 1, 1, 1, 1], 2: [0, 1, 0, 1, 1, 1]}
请注意,现在您的特征比以前少了一项,因为单词 third
不再是特征的一部分。
>>> vectorizer.get_feature_names()
['first', 'is', 'second', 'sentence', 'the', 'this']
这在计算向量之间的相似性时可能会有点问题,因为您要丢弃可用于区分向量的单词。
另一方面,您可以使用英语词典或其子集作为语料库并将其放入 vectorizer
。但是,生成的向量会变得更加稀疏,并且可能会再次导致比较向量的问题。但这将取决于您用来计算向量之间距离的方法。