CountVectorizer 但对于文本组
CountVectorizer but for group of text
使用以下代码,CountVectorizer 将“风干肉”分解为 3 个不同的向量。但我想要的是将“风干肉”保留为 1 个向量。我该怎么做?
密码我运行:
from sklearn.feature_extraction.text import CountVectorizer
food_names = ['Air-dried meat', 'Almonds', 'Amaranth']
count_vect = CountVectorizer(binary=True)
bow_rep = count_vect.fit(food_names)
#Look at the vocabulary mapping
print("Our vocabulary: ", count_vect.vocabulary_)
当前输出:
Our vocabulary: {'air': 0, 'dried': 3, 'meat': 4, 'almonds': 1, 'amaranth': 2}
期望的输出:
Our vocabulary: {'air-dried meat': 3, 'almonds': 1, 'amaranth': 2}
您可以使用 CountVectorizer 中的选项来更改行为 - 即。 token_pattern
或 tokenizer
.
如果你使用token_pattern='.+'
CountVectorizer(binary=True, token_pattern='.+')
然后它会将列表中的每个元素视为单个单词。
from sklearn.feature_extraction.text import CountVectorizer
food_names = ['Air-dried meat', 'Almonds', 'Amaranth']
count_vect = CountVectorizer(binary=True, token_pattern='.+')
bow_rep = count_vect.fit(food_names)
print("Our vocabulary:", count_vect.vocabulary_)
结果:
Our vocabulary: {'air-dried meat': 0, 'almonds': 1, 'amaranth': 2}
如果你使用tokenizer=shlex.split
CountVectorizer(binary=True, tokenizer=shlex.split)
然后你可以使用" "
将字符串中的单词分组
from sklearn.feature_extraction.text import CountVectorizer
import shlex
food_names = ['"Air-dried meat" other words', 'Almonds', 'Amaranth']
count_vect = CountVectorizer(binary=True, tokenizer=shlex.split)
bow_rep = count_vect.fit(food_names)
print("Our vocabulary:", count_vect.vocabulary_)
结果:
Our vocabulary: {'air-dried meat': 0, 'other': 3, 'words': 4, 'almonds': 1, 'amaranth': 2}
顺便说一句:DataScience
门户网站上的类似问题
how to avoid tokenizing w/ sklearn feature extraction
编辑:
您还可以将 food_names
转换为 lower()
并用作 vocabulary
vocabulary = [x.lower() for x in food_names]
count_vect = CountVectorizer(binary=True, vocabulary=vocabulary)
并且它也会将其视为词汇表中的单个元素
from sklearn.feature_extraction.text import CountVectorizer
food_names = ["Air-dried meat", "Almonds", "Amaranth"]
vocabulary = [x.lower() for x in food_names]
count_vect = CountVectorizer(binary=True, vocabulary=vocabulary)
bow_rep = count_vect.fit(food_names)
print("Our vocabulary:", count_vect.vocabulary_)
问题是当您想将这些方法与 transform()
一起使用时,因为只有 tokenizer=shlex.split
会在转换后的文本中拆分文本。但它也需要在文本中使用 " "
来捕获 Air-dried meat
from sklearn.feature_extraction.text import CountVectorizer
import shlex
food_names = ['"Air-dried meat" Almonds Amaranth']
count_vect = CountVectorizer(binary=True, tokenizer=shlex.split)
bow_rep = count_vect.fit(food_names)
print("Our vocabulary:", count_vect.vocabulary_)
text = 'Almonds of Germany'
temp = count_vect.transform([text])
print(text, temp.toarray())
text = '"Air-dried meat"'
temp = count_vect.transform([text])
print(text, temp.toarray())
text = 'Air-dried meat'
temp = count_vect.transform([text])
print(text, temp.toarray())
使用以下代码,CountVectorizer 将“风干肉”分解为 3 个不同的向量。但我想要的是将“风干肉”保留为 1 个向量。我该怎么做?
密码我运行:
from sklearn.feature_extraction.text import CountVectorizer
food_names = ['Air-dried meat', 'Almonds', 'Amaranth']
count_vect = CountVectorizer(binary=True)
bow_rep = count_vect.fit(food_names)
#Look at the vocabulary mapping
print("Our vocabulary: ", count_vect.vocabulary_)
当前输出:
Our vocabulary: {'air': 0, 'dried': 3, 'meat': 4, 'almonds': 1, 'amaranth': 2}
期望的输出:
Our vocabulary: {'air-dried meat': 3, 'almonds': 1, 'amaranth': 2}
您可以使用 CountVectorizer 中的选项来更改行为 - 即。 token_pattern
或 tokenizer
.
如果你使用token_pattern='.+'
CountVectorizer(binary=True, token_pattern='.+')
然后它会将列表中的每个元素视为单个单词。
from sklearn.feature_extraction.text import CountVectorizer
food_names = ['Air-dried meat', 'Almonds', 'Amaranth']
count_vect = CountVectorizer(binary=True, token_pattern='.+')
bow_rep = count_vect.fit(food_names)
print("Our vocabulary:", count_vect.vocabulary_)
结果:
Our vocabulary: {'air-dried meat': 0, 'almonds': 1, 'amaranth': 2}
如果你使用tokenizer=shlex.split
CountVectorizer(binary=True, tokenizer=shlex.split)
然后你可以使用" "
将字符串中的单词分组
from sklearn.feature_extraction.text import CountVectorizer
import shlex
food_names = ['"Air-dried meat" other words', 'Almonds', 'Amaranth']
count_vect = CountVectorizer(binary=True, tokenizer=shlex.split)
bow_rep = count_vect.fit(food_names)
print("Our vocabulary:", count_vect.vocabulary_)
结果:
Our vocabulary: {'air-dried meat': 0, 'other': 3, 'words': 4, 'almonds': 1, 'amaranth': 2}
顺便说一句:DataScience
门户网站上的类似问题how to avoid tokenizing w/ sklearn feature extraction
编辑:
您还可以将 food_names
转换为 lower()
并用作 vocabulary
vocabulary = [x.lower() for x in food_names]
count_vect = CountVectorizer(binary=True, vocabulary=vocabulary)
并且它也会将其视为词汇表中的单个元素
from sklearn.feature_extraction.text import CountVectorizer
food_names = ["Air-dried meat", "Almonds", "Amaranth"]
vocabulary = [x.lower() for x in food_names]
count_vect = CountVectorizer(binary=True, vocabulary=vocabulary)
bow_rep = count_vect.fit(food_names)
print("Our vocabulary:", count_vect.vocabulary_)
问题是当您想将这些方法与 transform()
一起使用时,因为只有 tokenizer=shlex.split
会在转换后的文本中拆分文本。但它也需要在文本中使用 " "
来捕获 Air-dried meat
from sklearn.feature_extraction.text import CountVectorizer
import shlex
food_names = ['"Air-dried meat" Almonds Amaranth']
count_vect = CountVectorizer(binary=True, tokenizer=shlex.split)
bow_rep = count_vect.fit(food_names)
print("Our vocabulary:", count_vect.vocabulary_)
text = 'Almonds of Germany'
temp = count_vect.transform([text])
print(text, temp.toarray())
text = '"Air-dried meat"'
temp = count_vect.transform([text])
print(text, temp.toarray())
text = 'Air-dried meat'
temp = count_vect.transform([text])
print(text, temp.toarray())