混合数据类型的转换器
Transfomers for mixed data types
我遇到了问题一次将不同的转换器应用于不同类型的列(文本与数字),并将这些转换器连接成一个以备后用。
我尝试按照 Column Transformer with Mixed Types 文档中的步骤进行操作,其中解释了如何对混合的分类数据和数字数据执行此操作,但它似乎不适用于文本数据。
TL;DR
如何创建遵循不同管道的文本和数字数据的可存储转换器?
数据下载和准备
# imports
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.preprocessing import StandardScaler
np.random.seed(0)
# download Titanic data
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
# data preparation
numeric_features = ['age', 'fare']
text_features = ['name', 'cabin', 'home.dest']
X.fillna({text_col: '' for text_col in text_features}, inplace=True)
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
转换数值特征:ok
按照上面link中的步骤,可以为数值特征创建一个转换器,如下所示:
# handling missing data and normalization
numeric_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
num_preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_features)])
# this works
num_preprocessor.fit(X_train)
train_feature_set = num_preprocessor.transform(X_train)
test_feature_set = num_preprocessor.transform(X_test)
# verify shape = (number of data points, number of numerical features (2) )
train_feature_set.shape # (1047, 2)
test_feature_set.shape # (262, 2)
转换文本特征:好的
为了处理文本特征,我使用 TF-IDF 对每个文本列进行矢量化(而不是连接所有文本列,然后只应用一次 TF-IDF):
# Tfidf of max 30 features
text_transformer = TfidfVectorizer(use_idf=True,
max_features=30)
# apply separately to each column
text_transformer_list = [(x + '_vectorizer', text_transformer, x) for x in text_features]
text_preprocessor = ColumnTransformer(transformers=text_transformer_list)
# this works
text_preprocessor.fit(X_train)
train_feature_set = text_preprocessor.transform(X_train)
test_feature_set = text_preprocessor.transform(X_test)
# verify shape = (number of data points, number of text features (3) times max_features(30) )
train_feature_set.shape # (1047, 90)
test_feature_set.shape # (262, 90)
你如何同时做这两件事?
我尝试了各种策略将上述两个过程保存在一个转换器中,但由于不同的错误,它们都失败了。
尝试 1:遵循记录的策略
遵循文档 (Column Transformer with Mixed Types) 不起作用,一旦文本数据替换了分类数据:
# documented strategy
sum_preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_features),
('text', text_transformer, text_features)])
# fails
sum_preprocessor.fit(X_train)
returns 以下错误消息:
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 1047 and the array at index 1 has size 3
尝试 2:FeatureUnion
在变形金刚列表中
# create a list of numerical transformer, like those for text
numerical_transformer_list = [(x + '_scaler', numeric_transformer, x) for x in numeric_features]
# fails
column_trans = FeatureUnion([text_transformer_list, numerical_transformer_list])
returns 以下错误消息:
TypeError: All estimators should implement fit and transform. '('cabin_vectorizer', TfidfVectorizer(max_features=30), 'cabin')' (type <class 'tuple'>) doesn't
尝试 3:ColumnTransformer
在变形金刚列表中
# create a list of all transformers, text and numerical
sum_transformer_list = text_transformer_list + numerical_transformer_list
# works
sum_preprocessor = ColumnTransformer(transformers=sum_transformer_list)
# fails
sum_preprocessor.fit(X_train)
returns 以下错误消息:
ValueError: Expected 2D array, got 1D array instead:
array=[54. nan nan ... 20. nan nan].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
我的问题
如何创建可以 fit
和 transform
数据混合文本和数字类型的单个对象?
简答:
all_transformers = text_transformer_list + [('num', numeric_transformer, numeric_features)]
all_preprocessor = ColumnTransformer(transformers=all_transformers)
all_preprocessor.fit(X_train)
train_all = all_preprocessor.transform(X_train)
test_all = all_preprocessor.transform(X_test)
print(train_all.shape, test_all.shape)
# prints (1047, 92) (262, 92)
这里的困难在于(大多数?)文本转换器需要一维输入,但是(大多数?)数字转换器需要二维输入。 ColumnTransformer
通过允许您指定单个列或列列表来处理该问题:在第一种情况下,一维数组传递给转换器,在第二种情况下传递二维数组。
所以,解释一下这三次尝试中的错误:
尝试 1:TF-IDF 正在接收一个二维数组,并将 列 视为文档而不是单个条目,因此仅产生三个输出。当它尝试将其连接到 1047 行数字输出时,失败了。
尝试 2:FeatureUnion
与 ColumnTransformer
的输入格式不同:在这种情况下,您不应该使用三元组 (name, transformer, columns)
。不管怎样,FeatureUnion
并不适合你在这里做的事情。
尝试 3:这次您尝试将 1d 数据发送到数值转换器,但那些需要 2d 数据。
我遇到了问题一次将不同的转换器应用于不同类型的列(文本与数字),并将这些转换器连接成一个以备后用。
我尝试按照 Column Transformer with Mixed Types 文档中的步骤进行操作,其中解释了如何对混合的分类数据和数字数据执行此操作,但它似乎不适用于文本数据。
TL;DR
如何创建遵循不同管道的文本和数字数据的可存储转换器?
数据下载和准备
# imports
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.preprocessing import StandardScaler
np.random.seed(0)
# download Titanic data
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
# data preparation
numeric_features = ['age', 'fare']
text_features = ['name', 'cabin', 'home.dest']
X.fillna({text_col: '' for text_col in text_features}, inplace=True)
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
转换数值特征:ok
按照上面link中的步骤,可以为数值特征创建一个转换器,如下所示:
# handling missing data and normalization
numeric_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
num_preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_features)])
# this works
num_preprocessor.fit(X_train)
train_feature_set = num_preprocessor.transform(X_train)
test_feature_set = num_preprocessor.transform(X_test)
# verify shape = (number of data points, number of numerical features (2) )
train_feature_set.shape # (1047, 2)
test_feature_set.shape # (262, 2)
转换文本特征:好的
为了处理文本特征,我使用 TF-IDF 对每个文本列进行矢量化(而不是连接所有文本列,然后只应用一次 TF-IDF):
# Tfidf of max 30 features
text_transformer = TfidfVectorizer(use_idf=True,
max_features=30)
# apply separately to each column
text_transformer_list = [(x + '_vectorizer', text_transformer, x) for x in text_features]
text_preprocessor = ColumnTransformer(transformers=text_transformer_list)
# this works
text_preprocessor.fit(X_train)
train_feature_set = text_preprocessor.transform(X_train)
test_feature_set = text_preprocessor.transform(X_test)
# verify shape = (number of data points, number of text features (3) times max_features(30) )
train_feature_set.shape # (1047, 90)
test_feature_set.shape # (262, 90)
你如何同时做这两件事?
我尝试了各种策略将上述两个过程保存在一个转换器中,但由于不同的错误,它们都失败了。
尝试 1:遵循记录的策略
遵循文档 (Column Transformer with Mixed Types) 不起作用,一旦文本数据替换了分类数据:
# documented strategy
sum_preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_features),
('text', text_transformer, text_features)])
# fails
sum_preprocessor.fit(X_train)
returns 以下错误消息:
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 1047 and the array at index 1 has size 3
尝试 2:FeatureUnion
在变形金刚列表中
# create a list of numerical transformer, like those for text
numerical_transformer_list = [(x + '_scaler', numeric_transformer, x) for x in numeric_features]
# fails
column_trans = FeatureUnion([text_transformer_list, numerical_transformer_list])
returns 以下错误消息:
TypeError: All estimators should implement fit and transform. '('cabin_vectorizer', TfidfVectorizer(max_features=30), 'cabin')' (type <class 'tuple'>) doesn't
尝试 3:ColumnTransformer
在变形金刚列表中
# create a list of all transformers, text and numerical
sum_transformer_list = text_transformer_list + numerical_transformer_list
# works
sum_preprocessor = ColumnTransformer(transformers=sum_transformer_list)
# fails
sum_preprocessor.fit(X_train)
returns 以下错误消息:
ValueError: Expected 2D array, got 1D array instead:
array=[54. nan nan ... 20. nan nan].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
我的问题
如何创建可以 fit
和 transform
数据混合文本和数字类型的单个对象?
简答:
all_transformers = text_transformer_list + [('num', numeric_transformer, numeric_features)]
all_preprocessor = ColumnTransformer(transformers=all_transformers)
all_preprocessor.fit(X_train)
train_all = all_preprocessor.transform(X_train)
test_all = all_preprocessor.transform(X_test)
print(train_all.shape, test_all.shape)
# prints (1047, 92) (262, 92)
这里的困难在于(大多数?)文本转换器需要一维输入,但是(大多数?)数字转换器需要二维输入。 ColumnTransformer
通过允许您指定单个列或列列表来处理该问题:在第一种情况下,一维数组传递给转换器,在第二种情况下传递二维数组。
所以,解释一下这三次尝试中的错误:
尝试 1:TF-IDF 正在接收一个二维数组,并将 列 视为文档而不是单个条目,因此仅产生三个输出。当它尝试将其连接到 1047 行数字输出时,失败了。
尝试 2:FeatureUnion
与 ColumnTransformer
的输入格式不同:在这种情况下,您不应该使用三元组 (name, transformer, columns)
。不管怎样,FeatureUnion
并不适合你在这里做的事情。
尝试 3:这次您尝试将 1d 数据发送到数值转换器,但那些需要 2d 数据。