如何使用具有不同特征的两个模型的集成学习作为输入?
How can I use Ensemble learning of two models with different features as an input?
我有一个假新闻检测问题,它通过向量化 'tweet' 列来预测二进制标签“1”和“0”,我使用三种不同的模型进行检测,但我想使用集成方法以提高准确性,但他们使用不同的 vectorezer。
I have 3 KNN models the first and the second one vectorizes the 'tweet' column using TF-IDF.
from sklearn.feature_extraction.text import TfidfVectorizer
vector = TfidfVectorizer(max_features =5000, ngram_range=(1,3))
X_train = vector.fit_transform(X_train['tweet']).toarray()
X_test = vector.fit_transform(X_test['tweet']).toarray()
for the third model I used fastText for sentence vectorization
%%time
sent_vec = []
for index, row in X_train.iterrows():
sent_vec.append(avg_feature_vector(row['tweet']))
%%time
sent_vec1 = []
for index, row in X_test.iterrows():
sent_vec1.append(avg_feature_vector(row['tweet']))
after scaling and... my third model fits the input like this
scaler.fit(sent_vec)
scaled_X_train= scaler.transform(sent_vec)
scaled_X_test= scaler.transform(sent_vec1)
.
.
.
knn_model1.fit(scaled_X_train, y_train)
now I want to combine the three models like this and I want the ensemble method to give me the majority just likeVotingClassifier
, but I have no idea how can I deal with the different inputs (TF-IDF & fastText) is there another way to do that?
您可以创建自定义 MyVotingClassifier
,它采用拟合模型而不是尚未训练的模型实例。在 VotingClassifier
中,sklearn 仅将未拟合的分类器作为输入并对其进行训练,然后对预测结果进行投票。你可以创建这样的东西。下面的函数可能不是确切的函数,但您可以为您的目的制作如下非常相似的函数。
from collections import Counter
clf1 = knn_model_1.fit(X1, y)
clf2 = knn_model_2.fit(X2, y)
clf3 = knn_model_3.fit(X3, y)
class MyVotingClassifier:
def __init__(self, **models):
self.models = models
def predict(dict_X):
'''
dict_X = {'knn_model_1': X1, 'knn_model_2': X2, 'knn_model_3': X3}
'''
preds = []
for model_name in dict_X:
model = self.models[model_name]
preds.append(model.predict(dict_X[model_name]))
preds = list(zip(*preds))
final_pred = list(map(lambda x: Counter(x).most_common(1)[0][0]))
return final_pred
ensemble_model = MyVotingClassifier(knn_model_1=clf1, knn_model_2=clf2, knn_model_3=clf3)
ensemble_model.predict({'knn_model_1': X1, 'knn_model_2': X2, 'knn_model_3': X3}) # Input the pre-processed `X`s
我有一个假新闻检测问题,它通过向量化 'tweet' 列来预测二进制标签“1”和“0”,我使用三种不同的模型进行检测,但我想使用集成方法以提高准确性,但他们使用不同的 vectorezer。
I have 3 KNN models the first and the second one vectorizes the 'tweet' column using TF-IDF.
from sklearn.feature_extraction.text import TfidfVectorizer
vector = TfidfVectorizer(max_features =5000, ngram_range=(1,3))
X_train = vector.fit_transform(X_train['tweet']).toarray()
X_test = vector.fit_transform(X_test['tweet']).toarray()
for the third model I used fastText for sentence vectorization
%%time
sent_vec = []
for index, row in X_train.iterrows():
sent_vec.append(avg_feature_vector(row['tweet']))
%%time
sent_vec1 = []
for index, row in X_test.iterrows():
sent_vec1.append(avg_feature_vector(row['tweet']))
after scaling and... my third model fits the input like this
scaler.fit(sent_vec)
scaled_X_train= scaler.transform(sent_vec)
scaled_X_test= scaler.transform(sent_vec1)
.
.
.
knn_model1.fit(scaled_X_train, y_train)
now I want to combine the three models like this and I want the ensemble method to give me the majority just like
VotingClassifier
, but I have no idea how can I deal with the different inputs (TF-IDF & fastText) is there another way to do that?
您可以创建自定义 MyVotingClassifier
,它采用拟合模型而不是尚未训练的模型实例。在 VotingClassifier
中,sklearn 仅将未拟合的分类器作为输入并对其进行训练,然后对预测结果进行投票。你可以创建这样的东西。下面的函数可能不是确切的函数,但您可以为您的目的制作如下非常相似的函数。
from collections import Counter
clf1 = knn_model_1.fit(X1, y)
clf2 = knn_model_2.fit(X2, y)
clf3 = knn_model_3.fit(X3, y)
class MyVotingClassifier:
def __init__(self, **models):
self.models = models
def predict(dict_X):
'''
dict_X = {'knn_model_1': X1, 'knn_model_2': X2, 'knn_model_3': X3}
'''
preds = []
for model_name in dict_X:
model = self.models[model_name]
preds.append(model.predict(dict_X[model_name]))
preds = list(zip(*preds))
final_pred = list(map(lambda x: Counter(x).most_common(1)[0][0]))
return final_pred
ensemble_model = MyVotingClassifier(knn_model_1=clf1, knn_model_2=clf2, knn_model_3=clf3)
ensemble_model.predict({'knn_model_1': X1, 'knn_model_2': X2, 'knn_model_3': X3}) # Input the pre-processed `X`s