CountVectorizer 但对于文本组

Question

使用以下代码，CountVectorizer 将“风干肉”分解为 3 个不同的向量。但我想要的是将“风干肉”保留为 1 个向量。我该怎么做？

密码我运行:

from sklearn.feature_extraction.text import CountVectorizer
food_names = ['Air-dried meat', 'Almonds', 'Amaranth']
count_vect = CountVectorizer(binary=True)
bow_rep = count_vect.fit(food_names)
#Look at the vocabulary mapping
print("Our vocabulary: ", count_vect.vocabulary_)

当前输出：

Our vocabulary:  {'air': 0, 'dried': 3, 'meat': 4, 'almonds': 1, 'amaranth': 2}

期望的输出：

Our vocabulary:  {'air-dried meat': 3, 'almonds': 1, 'amaranth': 2}

Answer 1

您可以使用 CountVectorizer 中的选项来更改行为 - 即。 token_pattern 或 tokenizer.

如果你使用token_pattern='.+'

CountVectorizer(binary=True, token_pattern='.+')

然后它会将列表中的每个元素视为单个单词。

from sklearn.feature_extraction.text import CountVectorizer

food_names = ['Air-dried meat', 'Almonds', 'Amaranth']

count_vect = CountVectorizer(binary=True, token_pattern='.+')
bow_rep = count_vect.fit(food_names)

print("Our vocabulary:", count_vect.vocabulary_)

结果：

Our vocabulary: {'air-dried meat': 0, 'almonds': 1, 'amaranth': 2}

如果你使用tokenizer=shlex.split

CountVectorizer(binary=True, tokenizer=shlex.split)

然后你可以使用" "将字符串中的单词分组

from sklearn.feature_extraction.text import CountVectorizer
import shlex

food_names = ['"Air-dried meat" other words', 'Almonds', 'Amaranth']

count_vect = CountVectorizer(binary=True, tokenizer=shlex.split)
bow_rep = count_vect.fit(food_names)

print("Our vocabulary:", count_vect.vocabulary_)

结果：

Our vocabulary: {'air-dried meat': 0, 'other': 3, 'words': 4, 'almonds': 1, 'amaranth': 2}

顺便说一句：DataScience

门户网站上的类似问题

how to avoid tokenizing w/ sklearn feature extraction

编辑：

您还可以将 food_names 转换为 lower() 并用作 vocabulary

vocabulary = [x.lower() for x in food_names]

count_vect = CountVectorizer(binary=True, vocabulary=vocabulary)

并且它也会将其视为词汇表中的单个元素

from sklearn.feature_extraction.text import CountVectorizer

food_names = ["Air-dried meat", "Almonds", "Amaranth"]
vocabulary = [x.lower() for x in food_names]

count_vect = CountVectorizer(binary=True, vocabulary=vocabulary)

bow_rep = count_vect.fit(food_names)
print("Our vocabulary:", count_vect.vocabulary_)

问题是当您想将这些方法与 transform() 一起使用时，因为只有 tokenizer=shlex.split 会在转换后的文本中拆分文本。但它也需要在文本中使用 " " 来捕获 Air-dried meat

from sklearn.feature_extraction.text import CountVectorizer
import shlex

food_names = ['"Air-dried meat" Almonds Amaranth']

count_vect = CountVectorizer(binary=True, tokenizer=shlex.split)
bow_rep = count_vect.fit(food_names)
print("Our vocabulary:", count_vect.vocabulary_)

text = 'Almonds of Germany'
temp = count_vect.transform([text])
print(text, temp.toarray())

text = '"Air-dried meat"'
temp = count_vect.transform([text])
print(text, temp.toarray())

text = 'Air-dried meat'
temp = count_vect.transform([text])
print(text, temp.toarray())

CountVectorizer 但对于文本组

CountVectorizer but for group of text

python

vectorization

scikit-learn