scikit 学习:如何在执行 TFIDFVectorizer 的拟合和转换后包含其他功能?

scikit learn: How to include others features after performed fit and transform of TFIDFVectorizer?

简单说一下我的情况: 我有 4 列输入:idtextcategorylabel.

我在 text 上使用了 TFIDFVectorizer,它为我提供了带有 TFIDF 分数的词标记的实例列表。

现在我想将 类别(无需通过 TFIDF)作为矢量化器输出数据中的另一个特征。

另请注意,在向量化之前,数据已通过 train_test_split

我怎样才能做到这一点?

初始代码:

#initialization
import pandas as pd
path = 'data\data.csv'
rappler= pd.read_csv(path)
X = rappler.text
y = rappler.label
#rappler.category - contains category for each instance

#split train test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

#feature extraction
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
X_train_dtm = vect.fit_transform(X_train)
#after or even prior to perform fit_transform, how can I properly add category as a feature?
X_test_dtm = vect.transform(X_test)

#actual classfication
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
y_pred_class = nb.predict(X_test_dtm)

#display result
from sklearn import metrics
print(metrics.accuracy_score(y_test,y_pred_class))

我建议在提取特征后进行训练拆分。

获得 TF-IDF 特征列表后,只需为每个样本添加其他特征。

您必须对类别特征进行编码,sklearn's LabelEncoder 是一个不错的选择。那么你应该有两组可以连接的numpy数组。

这是一个玩具示例:

X_tfidf = np.array([[0.1, 0.4, 0.2], [0.5, 0.4, 0.6]])
X_category = np.array([[1], [2]])
X = np.concatenate((X_tfidf, X_category), axis=1)

在这一点上,您将像以前一样继续,从训练测试拆分开始。

You should use FeatureUnions - as explained in the documentation

FeatureUnions combines several transformer objects into a new transformer that combines their output. A FeatureUnion takes a list of transformer objects. During fitting, each of these is fit to the data independently. For transforming data, the transformers are applied in parallel, and the sample vectors they output are concatenated end-to-end into larger vectors.

关于如何使用 FeatureUnion 的另一个很好的例子可以在这里找到:http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html

像@AlexG 建议的那样连接不同的矩阵可能是一个更简单的选择,但 FeatureUnion 是执行这些操作的 scikit-learn 方法。