将 sklearn 管道和交叉验证与二进制列相结合

Question

我想运行一个数据集的回归模型，其中包含一个文本列、五个二进制变量和一个数字目标变量。我包含了一个 CountVectorizer 来向量化文本列，并尝试使用 make_column_transformer 将它组合到 sklearn Pipeline 中。数据没有任何缺失值 - 但是，当运行执行以下脚本时，我收到以下警告：

FitFailedWarning: Estimator fit failed. The score on this train-test 
partition for these parameters will be set to nan.

和以下错误消息：

TypeError: All estimators should implement fit and transform, or can be 
'drop' or 'passthrough' specifiers. 'Level1' (type <class 'str'>) doesn't.

我假设问题可能是我没有指定第二个元组 make_column_transformer 但仅以下内容： sample_df[categorical_cols] 但我不确定如何包含一个已经 make_column_transformer.

中已处理、准备好的数据

完整代码：

from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import KFold
from sklearn.compose import make_column_transformer
from sklearn.model_selection import cross_val_score


categorical_cols = [col for col in sample_df.columns if col.startswith('Level')]
textual_col = ['Text']

pipeline = Pipeline([
    ('transformer', make_column_transformer((CountVectorizer(), textual_col), 
                                             sample_df[categorical_cols],
                                           remainder='passthrough')),
    ('model', RandomForestRegressor())
])

X = sample_df[textual_col + categorical_cols]
y = sample_df['Value']

cv = KFold(n_splits=5, shuffle=True)
scores = cross_val_score(pipeline, X, y, cv=cv)
scores

示例数据集：

import io

data_string = """
Level1;Level2;Level3;Level4;Level5;Text;Value
0;0;1;0;0;Are you sure that the input;109.3
0;0;0;0;0;that the input text data for;87.2
0;0;1;0;0;text data for your model is;21.5
0;0;0;0;0;your model is in English? Well,;143.5
0;0;0;0;1;in English? Well, no one can;141.1
0;0;0;0;0;no one can be sure about;93.4
0;0;0;0;0;be sure about this, as no;29.5
0;0;0;0;0;this, as no one will read;17.9
0;0;1;0;0;one will read around 20k records;37.8
0;0;1;0;0;around 20k records of text data.;153.7
0;0;0;0;0;of text data. So, how non-English;99.5
0;0;0;1;0;So, how non-English text will affect;119.1
0;0;0;0;1;text will affect your English text;97.5
0;0;0;0;0;your English text trained model? Pick;49.2
0;0;0;0;0;trained model? Pick any non-English text;79.3
0;0;0;0;0;any non-English text and pass it;107.7
0;1;0;0;1;and pass it through as input;117.3
0;0;0;0;0;through as input to your English;151.1
0;0;0;0;0;to your English text trained classification;47.3
0;0;0;0;0;text trained classification model. You will;129.3
0;0;0;0;0;model. You will come to know;135.1
0;0;0;0;0;come to know that the category;145.8
0;0;0;0;1;that the category is assigned to;131.9
1;0;0;1;0;is assigned to non-English text by;43.7
1;0;0;0;0;non-English text by the model. If;67.1
1;0;0;0;0;the model. If your model is;105.3
0;0;0;1;0;your model is dependent on one;65.2
0;1;0;0;0;dependent on one language then, other;98.3
0;0;0;0;0;language then, other languages in your;130.5
0;0;0;0;0;languages in your textual data should;107.2
0;1;1;0;0;textual data should be considered as;66.5
0;0;0;1;0;be considered as noise. But why?;43.1
0;0;0;0;1;noise. But why? The job of;56.7
0;0;0;0;0;The job of the text classification;75.1
1;0;0;0;0;the text classification model is to;88.3
1;0;0;0;0;model is to classify. And, it;91.3
0;0;0;0;0;classify. And, it will do its;106.4
1;0;0;0;0;will do its job despite its;109.5
0;0;0;0;1;job despite its input text will;143.1
0;0;0;0;0;input text will be in English;54.1
1;0;0;0;0;be in English or not. What;96.4
0;0;0;1;0;or not. What can we do;133.8
0;0;0;0;0;can we do to avoid such;146.4
0;0;1;0;0;to avoid such a situation? Your;164.3
0;0;1;0;0;a situation? Your model will not;34.6
0;0;0;0;0;model will not stop classifying the;76.8
0;0;0;1;0;stop classifying the non-English text. So,;80.5
0;0;1;0;0;non-English text. So, you have to;90.3
0;0;0;0;0;you have to detect the non-English;68.3
0;0;0;0;0;detect the non-English text and remove;44.0
0;0;1;0;0;text and remove it from trained;100.4
0;0;0;0;0;it from trained data and prediction;117.4
0;0;0;0;1;data and prediction data. This process;85.4
0;1;0;0;0;data. This process comes under the;65.7
0;0;1;0;0;comes under the data cleaning part.;54.3
0;1;0;0;0;data cleaning part. Inconsistency in your;78.9
0;0;0;0;0;Inconsistency in your data will result;96.8
1;0;0;0;1;data will result in a decrease;108.1
0;0;0;0;0;in a decrease in the accuracy;145.7
1;0;0;0;0;in the accuracy of the model.;103.6
0;0;1;0;0;of the model. Sometimes, multiple languages;56.4
0;0;0;0;1;Sometimes, multiple languages present in text;90.5
0;0;0;0;0;present in text data could be;80.4
0;0;0;0;0;data could be one of the;90.7
1;0;0;0;0;one of the reasons your model;48.8
0;0;0;0;0;reasons your model behaves strangely. So,;65.4
0;0;1;0;0;behaves strangely. So, in this article,;107.5
0;0;0;0;0;in this article, we will discuss;143.2
0;0;0;0;0;we will discuss the different python;165.0
0;0;0;0;0;the different python libraries which detect;123.3
0;0;0;0;1;libraries which detect the language(s) of;85.3
0;0;0;0;0;the language(s) of the text data.;91.4
0;0;0;0;1;the text data. Let’s start with;49.5
0;0;0;0;0;Let’s start with the spaCy library.;76.3
0;0;0;0;0;the spaCy library.;49.5
"""

sample_df = pd.read_csv(io.StringIO(data_string), sep=';')

Answer 1

您可以使用 remainder='passthrough' 来避免转换已处理的列（因此在您的情况下，您可以将二进制列视为 residual 列，您的 ColumnTransformer 对象不会处理，但它将通过 )。那么您应该知道 CountVectorizer 需要一个一维数组作为输入，因此您应该将要传递给 make_column_transformer 的列指定为字符串 ('Text')，而不是数组 ( ['Text'])（参见 make_column_transformer() doc 中的参考资料）。

columns : str, array-like of str, int, array-like of int, slice, array-like of bool or callable

Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name. A scalar string or int should be used where transformer expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer. A callable is passed the input data X and can return any of the above. To select multiple columns by name or dtype, you can use make_column_selector.

from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import KFold
from sklearn.compose import make_column_transformer
from sklearn.model_selection import cross_val_score

categorical_cols = [col for col in sample_df.columns if col.startswith('Level')]
textual_col = ['Text']
pipeline = Pipeline([
    ('transformer', make_column_transformer((CountVectorizer(), 'Text'), 
                                             remainder='passthrough')),
    ('model', RandomForestRegressor())
])
X = sample_df[textual_col + categorical_cols]
y = sample_df['Value']
cv = KFold(n_splits=5, shuffle=True)
scores = cross_val_score(pipeline, X, y, cv=cv)
scores

将 sklearn 管道和交叉验证与二进制列相结合

Combining sklearn pipeline and cross validation with binary columns

python

regression

scikit-learn