如何在当前的词袋分类中添加另一个文本特征?在 Scikit-learn 中
How to add another text feature to current bag of words classification? In Scikit-learn
这是我的输入矩阵enter image description here
我的示例代码:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
X_train, X_test, y_train, y_test = train_test_split(data['Extract'],
data['Expense Account code Description'], random_state = 0)
from sklearn.pipeline import Pipeline , FeatureUnion
text_clf = Pipeline([('vect', CountVectorizer(ngram_range=(1,1))),
('tfidf', TfidfTransformer(use_idf = False)),
('clf', RandomForestClassifier(n_estimators =100,
max_features='log2',criterion = 'entropy')),
])
text_clf = text_clf.fit(X_train, y_train)
这里我正在为 'Extract' 列分类 'Expense Account code Description' 应用词袋模型,这里我得到大约 92% 的准确率,但是如果我想包括 'Vendor name' 作为另一个输入功能的集合我该怎么做。有什么办法可以和词袋一起做吗? ,
您可以使用FeatureUnion。
您还需要创建一个新的 Transformer class,并执行您需要采取的必要操作,即包括供应商名称、获取假人。
Feature Union 将适合您的管道。
供参考。
class get_Vendor(BaseEstimator,TransformerMixin):
def transform(self, X,y):
return
lr_tfidf = Pipeline([('features',FeatureUnion([('other',get_vendor()),
('vect', tfidf)])),('clf', RandomForestClassifier())])
这是我的输入矩阵enter image description here
我的示例代码:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
X_train, X_test, y_train, y_test = train_test_split(data['Extract'],
data['Expense Account code Description'], random_state = 0)
from sklearn.pipeline import Pipeline , FeatureUnion
text_clf = Pipeline([('vect', CountVectorizer(ngram_range=(1,1))),
('tfidf', TfidfTransformer(use_idf = False)),
('clf', RandomForestClassifier(n_estimators =100,
max_features='log2',criterion = 'entropy')),
])
text_clf = text_clf.fit(X_train, y_train)
这里我正在为 'Extract' 列分类 'Expense Account code Description' 应用词袋模型,这里我得到大约 92% 的准确率,但是如果我想包括 'Vendor name' 作为另一个输入功能的集合我该怎么做。有什么办法可以和词袋一起做吗? ,
您可以使用FeatureUnion。 您还需要创建一个新的 Transformer class,并执行您需要采取的必要操作,即包括供应商名称、获取假人。
Feature Union 将适合您的管道。
供参考。
class get_Vendor(BaseEstimator,TransformerMixin):
def transform(self, X,y):
return
lr_tfidf = Pipeline([('features',FeatureUnion([('other',get_vendor()),
('vect', tfidf)])),('clf', RandomForestClassifier())])