sklearn 特征联合
sklearn feature union
objective 是 运行 使用三个输入的多标签分类器。每个输入都是一个更大文档的摘录。管道有一个初步步骤,使用 tfidf
对每个摘录进行矢量化
x 是一个字符串列表,每个都是一个摘录。
下面的代码有效,但似乎忽略了列表的第二个和第三个元素..
def grid_search(train_x, train_y):
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.naive_bayes import MultinomialNB
parms={ 'tfidf__max_df': (0.25, 0.5, 0.75),
'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
'clf__estimator__alpha': (1e-2, 1e-3)
}
tfidf1 = ('tfidf', TfidfVectorizer(stop_words=stop_words))
vctrz= tfidf1
clsfy = ('clf', OneVsRestClassifier(MultinomialNB( fit_prior=True, class_prior=None)))
pipeline = Pipeline([ vctrz, clsfy ])
gs1 = GridSearchCV(pipeline, parms, cv=2, n_jobs=1, verbose=0)
gs1.fit(train_x, train_y)
return gs1.best_estimator_
classifier = grid_search(train_x, y_train)
我试了没成功
vctrz = [tfidf1,tfidf1,tfidf1]
我也试过 FeatureUnion
TFALL = [('tf1', TFIDFX1()) , ('tf2', TFIDFX2()) , ('tf3', TFIDFX3()) ]
#maybe the () are extraneous but without them I get a self less error
clsfy = ('clf', OneVsRestClassifier(MultinomialNB( fit_prior=True, class_prior=None)))
ppl = Pipeline([ ('feats', FeatureUnion(TFALL) ), clsfy ])
gs1 = GridSearchCV(ppl, parms, cv=2, n_jobs=1, verbose=5)
其中TFIDFX1构造如下
class TFIDFX1(BaseEstimator, TransformerMixin):
def __init__(self):
pass
def vectorize(self, doc):
return vect.fit(doc)
def transform(self, mylist, y=None):
return self.vectorize(mylist[0]) #would
def fit(self, df, y=None):
return self
为了简洁起见,我省略了 类 TFIDFX2 和 TFIDFX3,它们分别查看 mylist[1] 和 mylist[2],但其他方面相同
这失败了,回溯如下:
TypeError: float() argument must be a string or a number, not 'TfidfVectorizer'
非常感谢来自 SO 社区的任何帮助
即使三个输入是同质的,tfidf 步骤也不会自动跨数组
相反,您必须使用 featureunion 步骤并将三个输入组合为三个单独的 tfidf 子步骤,如 this example
感谢@Vivek Kumar
objective 是 运行 使用三个输入的多标签分类器。每个输入都是一个更大文档的摘录。管道有一个初步步骤,使用 tfidf
对每个摘录进行矢量化x 是一个字符串列表,每个都是一个摘录。
下面的代码有效,但似乎忽略了列表的第二个和第三个元素..
def grid_search(train_x, train_y):
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.naive_bayes import MultinomialNB
parms={ 'tfidf__max_df': (0.25, 0.5, 0.75),
'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
'clf__estimator__alpha': (1e-2, 1e-3)
}
tfidf1 = ('tfidf', TfidfVectorizer(stop_words=stop_words))
vctrz= tfidf1
clsfy = ('clf', OneVsRestClassifier(MultinomialNB( fit_prior=True, class_prior=None)))
pipeline = Pipeline([ vctrz, clsfy ])
gs1 = GridSearchCV(pipeline, parms, cv=2, n_jobs=1, verbose=0)
gs1.fit(train_x, train_y)
return gs1.best_estimator_
classifier = grid_search(train_x, y_train)
我试了没成功
vctrz = [tfidf1,tfidf1,tfidf1]
我也试过 FeatureUnion
TFALL = [('tf1', TFIDFX1()) , ('tf2', TFIDFX2()) , ('tf3', TFIDFX3()) ]
#maybe the () are extraneous but without them I get a self less error
clsfy = ('clf', OneVsRestClassifier(MultinomialNB( fit_prior=True, class_prior=None)))
ppl = Pipeline([ ('feats', FeatureUnion(TFALL) ), clsfy ])
gs1 = GridSearchCV(ppl, parms, cv=2, n_jobs=1, verbose=5)
其中TFIDFX1构造如下
class TFIDFX1(BaseEstimator, TransformerMixin):
def __init__(self):
pass
def vectorize(self, doc):
return vect.fit(doc)
def transform(self, mylist, y=None):
return self.vectorize(mylist[0]) #would
def fit(self, df, y=None):
return self
为了简洁起见,我省略了 类 TFIDFX2 和 TFIDFX3,它们分别查看 mylist[1] 和 mylist[2],但其他方面相同
这失败了,回溯如下:
TypeError: float() argument must be a string or a number, not 'TfidfVectorizer'
非常感谢来自 SO 社区的任何帮助
即使三个输入是同质的,tfidf 步骤也不会自动跨数组
相反,您必须使用 featureunion 步骤并将三个输入组合为三个单独的 tfidf 子步骤,如 this example
感谢@Vivek Kumar