sklearn TfidfVectorizer 自定义 ngrams 没有来自正则表达式模式的字符
sklearn TfidfVectorizer custom ngrams without characters from regex pattern
我想使用 sklearn TfidfVectorizer 执行自定义 ngram 向量化。生成的 ngram 不应包含给定正则表达式模式中的任何字符。不幸的是,自定义分词器函数在 analyzer='char'
(ngram 模式)时被完全忽略。请参阅以下示例:
import re
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
pattern = re.compile(r'[\.-]'). # split on '.' and on '-'
def tokenize(text):
return pattern.split(text)
corpus = np.array(['abc.xyz', 'zzz-m.j'])
# word vectorization
tfidf_vectorizer = TfidfVectorizer(tokenizer=tokenize, analyzer='word', stop_words='english')
tfidf_vectorizer.fit_transform(corpus)
print(tfidf_vectorizer.vocabulary_)
# Output -> {'abc': 0, 'xyz': 3, 'zzz': 4, 'm': 2, 'j': 1}
# This is ok!
# ngram vectorization
tfidf_vectorizer = TfidfVectorizer(tokenizer=tokenize, analyzer='char', ngram_range=(2, 2))
tfidf_vectorizer.fit_transform(corpus)
print(tfidf_vectorizer.vocabulary_)
# Output -> {'ab': 3, 'bc': 4, 'c.': 5, '.x': 2, 'xy': 7, 'yz': 8, 'zz': 10, 'z-': 9, '-m': 0, 'm.': 6, '.j': 1}
# This is not ok! I don't want ngrams to include the '.' and '-' chars used for tokenization
最好的方法是什么?
根据documentation,只有在analyzer=word
时才可以使用tokenizer
。这是他们的原话:
tokenizer (default=None)
Override the string tokenization step while preserving the preprocessing and n-grams generation steps. Only applies if analyzer == 'word'.
您可以采取一种解决方法,即从词汇表中删除所有包含 .
或 -
的标记。以下代码这样做:
from copy import copy
for token in copy(tfidf_vectorizer.vocabulary_):
if re.search(pattern, token):
del tfidf_vectorizer.vocabulary_[token]
print(tfidf_vectorizer.vocabulary_)
#{'ab': 3, 'bc': 4, 'xy': 7, 'yz': 8, 'zz': 10}
我使用 nltk
编写了以下解决方案:
import re
from nltk.util import ngrams
from sklearn.feature_extraction.text import TfidfVectorizer
pattern = re.compile(r'[\.-]'). # split on '.' and on '-'
corpus = np.array(['abc.xyz', 'zzz-m.j'])
def analyzer(text):
text = text.lower()
tokens = pattern.split(text)
return [''.join(ngram) for token in tokens for ngram in ngrams(token, 2)]
tfidf_vectorizer = TfidfVectorizer(analyzer=analyzer)
tfidf_vectorizer.fit_transform(corpus)
print(tfidf_vectorizer.vocabulary_)
# Output -> {'ab': 0, 'bc': 1, 'xy': 2, 'yz': 3, 'zz': 4}
不确定这是否是最好的方法。
我想使用 sklearn TfidfVectorizer 执行自定义 ngram 向量化。生成的 ngram 不应包含给定正则表达式模式中的任何字符。不幸的是,自定义分词器函数在 analyzer='char'
(ngram 模式)时被完全忽略。请参阅以下示例:
import re
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
pattern = re.compile(r'[\.-]'). # split on '.' and on '-'
def tokenize(text):
return pattern.split(text)
corpus = np.array(['abc.xyz', 'zzz-m.j'])
# word vectorization
tfidf_vectorizer = TfidfVectorizer(tokenizer=tokenize, analyzer='word', stop_words='english')
tfidf_vectorizer.fit_transform(corpus)
print(tfidf_vectorizer.vocabulary_)
# Output -> {'abc': 0, 'xyz': 3, 'zzz': 4, 'm': 2, 'j': 1}
# This is ok!
# ngram vectorization
tfidf_vectorizer = TfidfVectorizer(tokenizer=tokenize, analyzer='char', ngram_range=(2, 2))
tfidf_vectorizer.fit_transform(corpus)
print(tfidf_vectorizer.vocabulary_)
# Output -> {'ab': 3, 'bc': 4, 'c.': 5, '.x': 2, 'xy': 7, 'yz': 8, 'zz': 10, 'z-': 9, '-m': 0, 'm.': 6, '.j': 1}
# This is not ok! I don't want ngrams to include the '.' and '-' chars used for tokenization
最好的方法是什么?
根据documentation,只有在analyzer=word
时才可以使用tokenizer
。这是他们的原话:
tokenizer (default=None) Override the string tokenization step while preserving the preprocessing and n-grams generation steps. Only applies if analyzer == 'word'.
您可以采取一种解决方法,即从词汇表中删除所有包含 .
或 -
的标记。以下代码这样做:
from copy import copy
for token in copy(tfidf_vectorizer.vocabulary_):
if re.search(pattern, token):
del tfidf_vectorizer.vocabulary_[token]
print(tfidf_vectorizer.vocabulary_)
#{'ab': 3, 'bc': 4, 'xy': 7, 'yz': 8, 'zz': 10}
我使用 nltk
编写了以下解决方案:
import re
from nltk.util import ngrams
from sklearn.feature_extraction.text import TfidfVectorizer
pattern = re.compile(r'[\.-]'). # split on '.' and on '-'
corpus = np.array(['abc.xyz', 'zzz-m.j'])
def analyzer(text):
text = text.lower()
tokens = pattern.split(text)
return [''.join(ngram) for token in tokens for ngram in ngrams(token, 2)]
tfidf_vectorizer = TfidfVectorizer(analyzer=analyzer)
tfidf_vectorizer.fit_transform(corpus)
print(tfidf_vectorizer.vocabulary_)
# Output -> {'ab': 0, 'bc': 1, 'xy': 2, 'yz': 3, 'zz': 4}
不确定这是否是最好的方法。