CountVectorizer 变换后得到不匹配的词

Question

我正在使用计数向量化器在大型文本数据集中应用字符串匹配。我想要的是获得与结果矩阵中的任何术语都不匹配的单词。例如，如果拟合后的结果项（特征）是：

{'hello world', 'world and', 'and Whosebug', 'hello', 'world', 'Whosebug', 'and'}

我运行形成了这段文字：

"oh hello world and Whosebug this is a great morning"

我想获取字符串 oh this is a greate morining，因为它与特征中的任何内容都不匹配。有什么有效的方法可以做到这一点？

我尝试使用 inverse_transform 方法获取特征并将它们从文本中删除，但我运行遇到很多问题并且花费了很长时间运行。

Answer 1

根据合适的词汇表转换一段文本将return你得到一个包含已知词汇表的矩阵。

例如，如果您的输入文档如您的示例所示：

from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(ngram_range=(1, 2))

docs = ['hello world and Whosebug']
vec.fit(docs)

那么拟合的词汇表可能如下所示：

In [522]: print(vec.vocabulary_)
{'hello': 2, 
 'world': 5, 
 'and': 0, 
 'Whosebug': 4, 
 'hello world': 3, 
 'world and': 6, 
 'and Whosebug': 1}

表示一个标记到索引的映射。随后转换一些新文档 return 是一个包含所有已知词汇标记计数的矩阵。 不在词汇表中的单词将被忽略！

other_docs = ['hello Whosebug', 
              'hello and hello', 
              'oh hello world and Whosebug this is a great morning']

X = vec.transform(other_docs)

In [523]: print(X.A)
[[0 0 1 0 1 0 0]
[1 0 2 0 0 0 0]
[1 1 1 1 1 1 1]]

您的词汇表包含 7 个项目，因此矩阵 X 包含 7 列。我们已经转换了 3 个文档，所以它是一个 3x7 矩阵。矩阵的元素是计数，即特定单词在文档中出现的频率。例如，对于第二个文档 "hello and hello"，我们在第 2 列（0 索引）中有一个 2 的计数，在第 0 列中有一个 1 的计数，它们引用 "hello" 和 "and"。

逆变换是从特征（即索引）到词汇项的映射：

In [534]: print(vec.inverse_transform([1, 2, 3, 4, 5, 6, 7]))
[array(['and', 'and Whosebug', 'hello', 'hello world',
   'Whosebug', 'world', 'world and'], dtype='<U17')]

注意： 现在是 1 索引 w.r.t。到上面打印的词汇索引。

现在让我们开始讨论您的实际问题，即识别给定输入文档中的所有词汇外 (OOV) 项目。如果您只对 unigrams 感兴趣，使用 sets 非常简单：

tokens = 'oh hello world and Whosebug this is a great morning'.split()
In [542]: print(set(tokens) - set(vec.vocabulary_.keys()))
{'morning', 'a', 'is', 'this', 'oh', 'great'}

如果您还对二元语法（或任何其他 n > 1 的 n-gram）感兴趣，事情会稍微复杂一些，因为首先您需要从输入文档生成所有二元语法（注意有各种从输入文档生成所有 ngram 的方法，以下只是其中一种）：

bigrams = list(map(lambda x: ' '.join(x), zip(tokens, tokens[1:])))
In [546]: print(bigrams)
['oh hello', 'hello world', 'world and', 'and Whosebug', 'Whosebug     this', 'this is', 'is a', 'a great', 'great morning']

这一行看起来很花哨，但它所做的只是 zip 两个列表在一起（第二个列表从第二个项目开始），这会产生一个元组，例如 ('oh', 'hello')，map 语句只是通过单个 space 连接元组，以便将 ('oh', 'hello') 转换为 'oh hello'，随后映射生成器被转换为 list。现在你可以建立一元和二元的联合：

doc_vocab = set(tokens) | set(bigrams)
In [549]: print(doc_vocab)
{'and Whosebug', 'hello', 'a', 'morning', 'hello world', 'great morning', 'world', 'Whosebug', 'Whosebug this', 'is', 'world and', 'oh hello', 'oh', 'this', 'is a', 'this is', 'and', 'a great', 'great'}

现在您可以使用与上面的 unigrams 相同的方法来检索所有 OOV 项目：

In [550]: print(doc_vocab - set(vec.vocabulary_.keys()))
{'morning', 'a', 'great morning', 'Whosebug this', 'is a', 'is', 'oh hello', 'this', 'this is', 'oh', 'a great', 'great'}

这现在表示不在您的矢量化程序词汇表中的所有一元字母和二元字母。

CountVectorizer 变换后得到不匹配的词

get unmatched words after CountVectorizer transform

python

string

python-3.x

scikit-learn

countvectorizer