用不同的预处理过程处理不同的列

Process different colum with different pre-processing process

我有以下df

         text     count     daytime        label
   I think...        4      morning          pos
You should...        3    afternoon          neg
    Better...        7      evening          neu

我尝试使用

使用 ColumnTransform 仅预处理 text
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer

transformer = ColumnTransformer([
    ('vectorizer', TfidfVectorizer(ngram_range=(1, 1)), 'text')
], remainder='passthrough')

它运行良好。然后我想通过以下代码分别应用countdaytime

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

transformer = ColumnTransformer([
    ('vectorizer', TfidfVectorizer(ngram_range=(1, 1)), 'text'),
    ('scaler', StandardScaler(), 'count'),
    ('enc', OneHotEncoder(), 'daytime')
], remainder='passthrough')

X_transformed = transformer.fit_transform(X)

它给了我错误

1D data passed to a transformer that expects 2D data. Try to specify the column selection as a list of one item instead of a scalar.

我认为问题在于 standardscaler,它只传递 1D。我该如何解决?

您必须用逗号分隔元组列表中的每个元组。由于 StandardScalerOneHotEncoder 需要 2D 输入,因此您应该按照错误消息的建议,将列选择器作为这些转换器的一项列表传递:

transformer = ColumnTransformer([
    ('vectorizer', TfidfVectorizer(ngram_range=(1, 1)), 'text'), 
    ('scaler', StandardScaler(), ['count']),  
    ('enc', OneHotEncoder(), ['daytime'])
], remainder='passthrough')