是否在分类器中正确选择和使用了所有特征?

Are all the features correctly selected and used in a classifier?

我想知道当我使用分类器时,例如:

random_forest_bow = Pipeline([
        ('rf_tfidf',Feat_Selection. countV),
        ('rf_clf',RandomForestClassifier(n_estimators=300,n_jobs=3))
        ])
    
random_forest_ngram.fit(DataPrep.train['Text'],DataPrep.train['Label'])
predicted_rf_ngram = random_forest_ngram.predict(DataPrep.test_news['Text'])
np.mean(predicted_rf_ngram == DataPrep.test_news['Label'])

我也在考虑模型中的其他功能。我定义 X 和 y 如下:

X=df[['Text','is_it_capital?', 'is_it_upper?', 'contains_num?']]
y=df['Label']

X_train, X_test, y_train, y_test  = train_test_split(X, y, test_size=0.25, random_state=40) 

df_train= pd.concat([X_train, y_train], axis=1)
df_test = pd.concat([X_test, y_test], axis=1)

countV = CountVectorizer()
train_count = countV.fit_transform(df.train['Text'].values)

我的数据集如下所示

Text                             is_it_capital?     is_it_upper?      contains_num?   Label
an example of text                      0                  0               0            0
ANOTHER example of text                 1                  1               0            1
What's happening?Let's talk at 5        1                  0               1            1

我也想将 is_it_capital?is_it_upper?contains_num? 用作特征,但由于它们具有二进制值(编码后为 1 或 0),我应该应用 BoW仅在 Text 上提取额外的特征。 也许我的问题很明显,但由于我是一名新的 ML 学习者并且我不熟悉分类器和编码,我将感谢您提供的所有支持和评论。谢谢


您当然可以使用 is_it_capital?is_it_upper?contains_num? 等“额外”功能。您似乎正在为如何准确地组合这两个看似完全不同的功能集而苦苦挣扎。您可以使用 sklearn.pipeline.FeatureUnion or sklearn.compose.ColumnTransformer 之类的东西将不同的编码策略应用于每组功能。您没有理由不能将您的额外特征与任何文本特征提取方法(例如您的 BoW 方法)结合使用。

df = pd.DataFrame({'text': ['this is some text', 'this is some MORE text', 'hi hi some text 123', 'bananas oranges'], 'is_it_upper': [0, 1, 0, 0], 'contains_num': [0, 0, 1, 0]})

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.compose import ColumnTransformer

transformer = ColumnTransformer([('text', CountVectorizer(), 'text')], remainder='passthrough')
X = transformer.fit_transform(df)

print(X)
[[0 0 0 1 0 0 1 1 1 0 0]
 [0 0 0 1 1 0 1 1 1 1 0]
 [1 0 2 0 0 0 1 1 0 0 1]
 [0 1 0 0 0 1 0 0 0 0 0]]
print(transformer.get_feature_names())
['text__123', 'text__bananas', 'text__hi', 'text__is', 'text__more', 'text__oranges', 'text__some', 'text__text', 'text__this', 'is_it_upper', 'contains_num']

关于您的具体示例的更多信息:

X=df[['Text','is_it_capital?', 'is_it_upper?', 'contains_num?']]
y=df['Label']

# Need to use DenseTransformer to properly concatenate results
# from CountVectorizer and other transformer steps
from sklearn.base import TransformerMixin
class DenseTransformer(TransformerMixin):
    def fit(self, X, y=None, **fit_params):
        return self
    def transform(self, X, y=None, **fit_params):
        return X.todense()

from sklearn.pipeline import Pipeline
pipeline = Pipeline([
     ('vectorizer', CountVectorizer()), 
     ('to_dense', DenseTransformer()), 
])

transformer = ColumnTransformer([('text', pipeline, 'Text')], remainder='passthrough')

X_train, X_test, y_train, y_test  = train_test_split(X, y, test_size=0.25, random_state=40)

X_train = transformer.fit_transform(X_train)
X_test = transformer.transform(X_test)

df_train = pd.concat([X_train, y_train], axis=1)
df_test = pd.concat([X_test, y_test], axis=1)

我发现有用的是以我完全控制的方式进行转换。对于每组列,我将执行特定的转换,然后最后合并我的转换:这是示例

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.ensemble import RandomForestClassifier

# boolean
boolean_features = ['is_it_capital?', 'is_it_upper?','contains_num?',]
boolen_transformer = Pipeline(
    steps=[
        ('imputer', SimpleImputer(strategy='most_frequent',)),
)
    ]
)

text_features = 'Text'
text_transformer = Pipeline(
    steps=[('vectorizer', CountVectorizer())]
)

# merge all pipelines

preprocessor = ColumnTransformer(
    transformers=[
        ('bool', boolean_transformer, boolean_features),
        ('text', text_transformer, text_features),
    ]
)

pipelines = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('model', RandomForestClassifier(n_estimators=300,n_jobs=3))
    ]
)

# spilt data to train and test
X_train_, X_test, y_train_, y_test = train_test_split(X, y, test_size=.1, random_state=42, stratify=y)


# we can train our model
pipelines.fit(X_train, y_train)
pipeline.score(X_test, y_test)

# what is awesome is using other tools like GridSearch becomes easy.

params = {'model__ n_estimators': [100, 200, 300], 'model__ criterion': ['gini', 'entropy']}

clf = GridSearchCV(
    pipelines, cv=5, n_jobs=-1, param_grid=params, scoring='roc_auc'
)

clf.fit(X_train, y_train)

# predict for totally unseen data
clf.predict(X_test)

更新

如果我们有不需要转换但需要包含的列,请添加remainder='passthrough'

# assumption: above code does not have boolen_X
# ...
preprocessor = ColumnTransformer(
    transformers=[
        ('text', text_transformer, text_features),

    ], remainder='passthrough'
)
#...

查看 scikit-learn 文档和用法示例: