用不同的预处理过程处理不同的列
Process different colum with different pre-processing process
我有以下df
text count daytime label
I think... 4 morning pos
You should... 3 afternoon neg
Better... 7 evening neu
我尝试使用
使用 ColumnTransform 仅预处理 text
列
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
transformer = ColumnTransformer([
('vectorizer', TfidfVectorizer(ngram_range=(1, 1)), 'text')
], remainder='passthrough')
它运行良好。然后我想通过以下代码分别应用count
和daytime
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
transformer = ColumnTransformer([
('vectorizer', TfidfVectorizer(ngram_range=(1, 1)), 'text'),
('scaler', StandardScaler(), 'count'),
('enc', OneHotEncoder(), 'daytime')
], remainder='passthrough')
X_transformed = transformer.fit_transform(X)
它给了我错误
1D data passed to a transformer that expects 2D data. Try to specify the column selection as a list of one item instead of a scalar.
我认为问题在于 standardscaler,它只传递 1D。我该如何解决?
您必须用逗号分隔元组列表中的每个元组。由于 StandardScaler
和 OneHotEncoder
需要 2D 输入,因此您应该按照错误消息的建议,将列选择器作为这些转换器的一项列表传递:
transformer = ColumnTransformer([
('vectorizer', TfidfVectorizer(ngram_range=(1, 1)), 'text'),
('scaler', StandardScaler(), ['count']),
('enc', OneHotEncoder(), ['daytime'])
], remainder='passthrough')
我有以下df
text count daytime label
I think... 4 morning pos
You should... 3 afternoon neg
Better... 7 evening neu
我尝试使用
使用 ColumnTransform 仅预处理text
列
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
transformer = ColumnTransformer([
('vectorizer', TfidfVectorizer(ngram_range=(1, 1)), 'text')
], remainder='passthrough')
它运行良好。然后我想通过以下代码分别应用count
和daytime
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
transformer = ColumnTransformer([
('vectorizer', TfidfVectorizer(ngram_range=(1, 1)), 'text'),
('scaler', StandardScaler(), 'count'),
('enc', OneHotEncoder(), 'daytime')
], remainder='passthrough')
X_transformed = transformer.fit_transform(X)
它给了我错误
1D data passed to a transformer that expects 2D data. Try to specify the column selection as a list of one item instead of a scalar.
我认为问题在于 standardscaler,它只传递 1D。我该如何解决?
您必须用逗号分隔元组列表中的每个元组。由于 StandardScaler
和 OneHotEncoder
需要 2D 输入,因此您应该按照错误消息的建议,将列选择器作为这些转换器的一项列表传递:
transformer = ColumnTransformer([
('vectorizer', TfidfVectorizer(ngram_range=(1, 1)), 'text'),
('scaler', StandardScaler(), ['count']),
('enc', OneHotEncoder(), ['daytime'])
], remainder='passthrough')