"word boundaries" 如何在 Python sklearn CountVectorizer 的分析器参数中识别?
How are "word boundaries" identified in Python sklearn CountVectorizer's analyzer parameter?
Python sklearn CountVectorizer 有一个 "analyzer" 参数,它有一个 "char_wb" 选项。根据定义,
"Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.".
我的问题是,CountVectorizer 如何从字符串中识别出 "word"?更具体地说,"words" 只是句子中 space 分隔的字符串,还是通过更复杂的技术(如 nltk 中的 word_tokenize?
来识别它们?
我问这个问题的原因是我正在分析社交媒体数据,其中包含大量@mentions 和#hashtags。现在,nltk 的 word_tokenize 将“@mention”分解为 [“@”,"mention], and a "#hashtag”分解为 [“#”,"hashtag"]。如果我将这些输入 CountVectorizer ngram_range > 1,“#”和“@”将永远不会被捕获为特征。此外,我希望字符 n-gram(char_wb)将“@m”和“#h”捕获为功能,如果 CountVectorizer 将@mentions 和#hashtags 分解为 ["@","mentions"] 和 ["#","hashtags"].
,则永远不会发生这种情况
我该怎么办?
正如您在 source code.
中看到的那样,它用空格分隔单词
def _char_wb_ngrams(self, text_document):
"""Whitespace sensitive char-n-gram tokenization.
Tokenize text_document into a sequence of character n-grams
operating only inside word boundaries. n-grams at the edges
of words are padded with space."""
# normalize white spaces
text_document = self._white_spaces.sub(" ", text_document)
min_n, max_n = self.ngram_range
ngrams = []
# bind method outside of loop to reduce overhead
ngrams_append = ngrams.append
for w in text_document.split():
w = ' ' + w + ' '
w_len = len(w)
for n in range(min_n, max_n + 1):
offset = 0
ngrams_append(w[offset:offset + n])
while offset + n < w_len:
offset += 1
ngrams_append(w[offset:offset + n])
if offset == 0: # count a short word (w_len < n) only once
break
return ngrams
text_document.split() 按空格分割。
Python sklearn CountVectorizer 有一个 "analyzer" 参数,它有一个 "char_wb" 选项。根据定义,
"Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.".
我的问题是,CountVectorizer 如何从字符串中识别出 "word"?更具体地说,"words" 只是句子中 space 分隔的字符串,还是通过更复杂的技术(如 nltk 中的 word_tokenize?
来识别它们?我问这个问题的原因是我正在分析社交媒体数据,其中包含大量@mentions 和#hashtags。现在,nltk 的 word_tokenize 将“@mention”分解为 [“@”,"mention], and a "#hashtag”分解为 [“#”,"hashtag"]。如果我将这些输入 CountVectorizer ngram_range > 1,“#”和“@”将永远不会被捕获为特征。此外,我希望字符 n-gram(char_wb)将“@m”和“#h”捕获为功能,如果 CountVectorizer 将@mentions 和#hashtags 分解为 ["@","mentions"] 和 ["#","hashtags"].
,则永远不会发生这种情况我该怎么办?
正如您在 source code.
中看到的那样,它用空格分隔单词def _char_wb_ngrams(self, text_document):
"""Whitespace sensitive char-n-gram tokenization.
Tokenize text_document into a sequence of character n-grams
operating only inside word boundaries. n-grams at the edges
of words are padded with space."""
# normalize white spaces
text_document = self._white_spaces.sub(" ", text_document)
min_n, max_n = self.ngram_range
ngrams = []
# bind method outside of loop to reduce overhead
ngrams_append = ngrams.append
for w in text_document.split():
w = ' ' + w + ' '
w_len = len(w)
for n in range(min_n, max_n + 1):
offset = 0
ngrams_append(w[offset:offset + n])
while offset + n < w_len:
offset += 1
ngrams_append(w[offset:offset + n])
if offset == 0: # count a short word (w_len < n) only once
break
return ngrams
text_document.split() 按空格分割。