为什么文本的特征提取不 return 所有可能的特征名称?
Why feature extraction of text don't return all possible feature names?
这是书中的代码片段
使用 PyTorch 进行自然语言处理:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
import seaborn as sns
corpus = ['Time flies flies like an arrow.', 'Fruit flies like a banana.']
one_hot_vectorizer = CountVectorizer()
vocab = one_hot_vectorizer.get_feature_names()
vocab
的值:
vocab = ['an', 'arrow', 'banana', 'flies', 'fruit', 'like', 'time']
为什么提取的地物名称中没有'a'
?如果它被自动排除为太常见的词,为什么 "an" 不会因为同样的原因而被排除?如何让 .get_feature_names()
也过滤其他词?
问得好!虽然这不是 pytorch
问题而是 sklearn
一个问题 =)
我鼓励首先完成此 https://www.kaggle.com/alvations/basic-nlp-with-nltk,尤其是。 “使用 sklearn 进行矢量化”部分
TL;DR
如果我们使用CountVectorizer
、
from io import StringIO
from sklearn.feature_extraction.text import CountVectorizer
sent1 = "The quick brown fox jumps over the lazy brown dog."
sent2 = "Mr brown jumps over the lazy fox."
with StringIO('\n'.join([sent1, sent2])) as fin:
# Create the vectorizer
count_vect = CountVectorizer()
count_vect.fit_transform(fin)
# We can check the vocabulary in our vectorizer
# It's a dictionary where the words are the keys and
# The values are the IDs given to each word.
print(count_vect.vocabulary_)
[输出]:
{'brown': 0,
'dog': 1,
'fox': 2,
'jumps': 3,
'lazy': 4,
'mr': 5,
'over': 6,
'quick': 7,
'the': 8}
我们没有告诉向量化器去除标点符号和标记化和小写字母,他们是怎么做到的?
此外,词汇表中的 the 是一个停用词,我们希望它消失...
并且跳跃没有词干化或词形还原!
如果我们查看 sklearn 中 CountVectorizer 的文档,我们会看到:
CountVectorizer(
input=’content’, encoding=’utf-8’,
decode_error=’strict’, strip_accents=None,
lowercase=True, preprocessor=None,
tokenizer=None, stop_words=None,
token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1),
analyzer=’word’, max_df=1.0, min_df=1,
max_features=None, vocabulary=None,
binary=False, dtype=<class ‘numpy.int64’>)
更具体地说:
analyzer : string, {‘word’, ‘char’, ‘char_wb’} or callable
Whether the feature should be made of word or character n-grams.
Option ‘char_wb’ creates character n-grams only from text inside word
boundaries; n-grams at the edges of words are padded with space. If a
callable is passed it is used to extract the sequence of features out
of the raw, unprocessed input.
preprocessor : callable or None (default)
Override the preprocessing (string transformation) stage while
preserving the tokenizing and n-grams generation steps.
tokenizer : callable or None (default)
Override the string tokenization step while preserving the
preprocessing and n-grams generation steps. Only applies if analyzer
== 'word'.
stop_words : string {‘english’}, list, or None (default)
If ‘english’, a built-in stop word list for English is used. If a
list, that list is assumed to contain stop words, all of which will be
removed from the resulting tokens. Only applies if analyzer == 'word'.
If None, no stop words will be used.
lowercase : boolean, True by default
Convert all characters to lowercase before tokenizing.
但是对于 http://shop.oreilly.com/product/0636920063445.do 中的示例,并不是停用词导致了问题。
如果我们明确使用https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/stop_words.py
中的英语停用词
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> one_hot_vectorizer = CountVectorizer(stop_words='english')
>>> one_hot_vectorizer.fit(corpus)
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words='english',
strip_accents=None, token_pattern='(?u)\b\w\w+\b',
tokenizer=None, vocabulary=None)
>>> one_hot_vectorizer.get_feature_names()
['arrow', 'banana', 'flies', 'fruit', 'like', 'time']
那么在 stop_words
参数保留为 None 的情况下到底发生了什么?
让我们来做个实验,在输入中添加一些单字符单词:
>>> corpus = ['Time flies flies like an arrow 1 2 3.', 'Fruit flies like a banana x y z.']
>>> one_hot_vectorizer = CountVectorizer()
>>> one_hot_vectorizer.fit(corpus)
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern='(?u)\b\w\w+\b',
tokenizer=None, vocabulary=None)
>>> one_hot_vectorizer.get_feature_names()
['an', 'arrow', 'banana', 'flies', 'fruit', 'like', 'time']
他们又不见了!!!
现在,如果我们深入研究文档,https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py#L738
token_pattern : string
Regular expression denoting what constitutes a "token", only used
if analyzer == 'word'
. The default regexp select tokens of 2
or more alphanumeric characters (punctuation is completely ignored
and always treated as a token separator).
啊哈,这就是为什么所有单字符令牌都被删除的原因!
CountVectorizer
的默认模式是token_pattern=r"(?u)\b\w\w+\b"
,要让它接受单个字符,你可以试试:
>>> one_hot_vectorizer = CountVectorizer(token_pattern=r"(?u)\b\w+\b")
>>> one_hot_vectorizer.fit(corpus)
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern='(?u)\b\w+\b', tokenizer=None,
vocabulary=None)
>>> one_hot_vectorizer.get_feature_names()
['1', '2', '3', 'a', 'an', 'arrow', 'banana', 'flies', 'fruit', 'like', 'time', 'x', 'y', 'z']
这是书中的代码片段 使用 PyTorch 进行自然语言处理:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
import seaborn as sns
corpus = ['Time flies flies like an arrow.', 'Fruit flies like a banana.']
one_hot_vectorizer = CountVectorizer()
vocab = one_hot_vectorizer.get_feature_names()
vocab
的值:
vocab = ['an', 'arrow', 'banana', 'flies', 'fruit', 'like', 'time']
为什么提取的地物名称中没有'a'
?如果它被自动排除为太常见的词,为什么 "an" 不会因为同样的原因而被排除?如何让 .get_feature_names()
也过滤其他词?
问得好!虽然这不是 pytorch
问题而是 sklearn
一个问题 =)
我鼓励首先完成此 https://www.kaggle.com/alvations/basic-nlp-with-nltk,尤其是。 “使用 sklearn 进行矢量化”部分
TL;DR
如果我们使用CountVectorizer
、
from io import StringIO
from sklearn.feature_extraction.text import CountVectorizer
sent1 = "The quick brown fox jumps over the lazy brown dog."
sent2 = "Mr brown jumps over the lazy fox."
with StringIO('\n'.join([sent1, sent2])) as fin:
# Create the vectorizer
count_vect = CountVectorizer()
count_vect.fit_transform(fin)
# We can check the vocabulary in our vectorizer
# It's a dictionary where the words are the keys and
# The values are the IDs given to each word.
print(count_vect.vocabulary_)
[输出]:
{'brown': 0,
'dog': 1,
'fox': 2,
'jumps': 3,
'lazy': 4,
'mr': 5,
'over': 6,
'quick': 7,
'the': 8}
我们没有告诉向量化器去除标点符号和标记化和小写字母,他们是怎么做到的?
此外,词汇表中的 the 是一个停用词,我们希望它消失... 并且跳跃没有词干化或词形还原!
如果我们查看 sklearn 中 CountVectorizer 的文档,我们会看到:
CountVectorizer(
input=’content’, encoding=’utf-8’,
decode_error=’strict’, strip_accents=None,
lowercase=True, preprocessor=None,
tokenizer=None, stop_words=None,
token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1),
analyzer=’word’, max_df=1.0, min_df=1,
max_features=None, vocabulary=None,
binary=False, dtype=<class ‘numpy.int64’>)
更具体地说:
analyzer : string, {‘word’, ‘char’, ‘char_wb’} or callable
Whether the feature should be made of word or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space. If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed input.
preprocessor : callable or None (default)
Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps.
tokenizer : callable or None (default)
Override the string tokenization step while preserving the preprocessing and n-grams generation steps. Only applies if analyzer == 'word'.
stop_words : string {‘english’}, list, or None (default)
If ‘english’, a built-in stop word list for English is used. If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if analyzer == 'word'. If None, no stop words will be used.
lowercase : boolean, True by default
Convert all characters to lowercase before tokenizing.
但是对于 http://shop.oreilly.com/product/0636920063445.do 中的示例,并不是停用词导致了问题。
如果我们明确使用https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/stop_words.py
中的英语停用词>>> from sklearn.feature_extraction.text import CountVectorizer
>>> one_hot_vectorizer = CountVectorizer(stop_words='english')
>>> one_hot_vectorizer.fit(corpus)
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words='english',
strip_accents=None, token_pattern='(?u)\b\w\w+\b',
tokenizer=None, vocabulary=None)
>>> one_hot_vectorizer.get_feature_names()
['arrow', 'banana', 'flies', 'fruit', 'like', 'time']
那么在 stop_words
参数保留为 None 的情况下到底发生了什么?
让我们来做个实验,在输入中添加一些单字符单词:
>>> corpus = ['Time flies flies like an arrow 1 2 3.', 'Fruit flies like a banana x y z.']
>>> one_hot_vectorizer = CountVectorizer()
>>> one_hot_vectorizer.fit(corpus)
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern='(?u)\b\w\w+\b',
tokenizer=None, vocabulary=None)
>>> one_hot_vectorizer.get_feature_names()
['an', 'arrow', 'banana', 'flies', 'fruit', 'like', 'time']
他们又不见了!!!
现在,如果我们深入研究文档,https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py#L738
token_pattern : string Regular expression denoting what constitutes a "token", only used if
analyzer == 'word'
. The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).
啊哈,这就是为什么所有单字符令牌都被删除的原因!
CountVectorizer
的默认模式是token_pattern=r"(?u)\b\w\w+\b"
,要让它接受单个字符,你可以试试:
>>> one_hot_vectorizer = CountVectorizer(token_pattern=r"(?u)\b\w+\b")
>>> one_hot_vectorizer.fit(corpus)
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern='(?u)\b\w+\b', tokenizer=None,
vocabulary=None)
>>> one_hot_vectorizer.get_feature_names()
['1', '2', '3', 'a', 'an', 'arrow', 'banana', 'flies', 'fruit', 'like', 'time', 'x', 'y', 'z']