RandomForestClassifier is throwing error: one field contains comma-separated values
RandomForestClassifier is throwing error: one field contains comma-separated values
我正在尝试拟合 RandomForestClassifier,就像这样。
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(col_trans, rf_classifier)
pipe.fit(X_train, y_train)
我收到这个错误:
ValueError: Found unknown categories ['4G, 4G LAA, 5G NR', '4G,4G CBRS,5G FIXED'] in column 3 during transform
名为 'technology_type'
的字段包含逗号分隔值,如下所示:4G, 5G, NR
如何处理这些逗号分隔值?我想我可以删除该字段,但我真的想将它作为 X 的独立变量包含在内。
这是我的全部代码。
df_fuze = pd.read_sql("""select * from fuze""", conn_connection)
# copy features to new DF
fuze = df_fuze[['territory',
'submarket',
'local_market',
'technology_type',
'project_type',
'modification_type',
'objective',
'construction_completed_days']]
fuze.head()
# set dependent variable
y = fuze['construction_completed_days']
# set the independent variables
X = fuze.drop('construction_completed_days', 1)
seed = 50 # so that the result is reproducible
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.333, random_state = seed)
X_train = X_train.fillna('na')
X_test = X_test.fillna('na')
features_to_encode = list(X_train.select_dtypes(include = ['object']).columns)
# Or alternatively,
# features_to_encode = X_train.columns[X_train.dtypes==object].tolist()
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
col_trans = make_column_transformer(
(OneHotEncoder(),features_to_encode),
remainder = "passthrough"
)
from sklearn.ensemble import RandomForestClassifier
rf_classifier = RandomForestClassifier(
min_samples_leaf=50,
n_estimators=150,
bootstrap=True,
oob_score=True,
n_jobs=-1,
random_state=seed,
max_features='auto')
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(col_trans, rf_classifier)
pipe.fit(X_train, y_train)
尝试拟合 X 和 y 变量后发生错误。
我正在按照此处的示例进行操作。
https://towardsdatascience.com/my-random-forest-classifier-cheat-sheet-in-python-fedb84f8cf4f
假设您有这个数据集:
import pandas as pd
data = pd.DataFrame({'product_code': ['1', '2', '3', '4'],
'technology_type': ['4G, 4G LAA, 5G NR',
'4G,4G CBRS,5G FIXED',
'4G, 5G, NR',
'4G, NR']},
columns=['product_code', 'technology_type'])
输出:
product_code technology_type
1 4G, 4G LAA, 5G NR
2 4G,4G CBRS,5G FIXED
3 4G, 5G, NR
4 4G, NR
首先,您的数据应该一次包含一个 technology_type
类别。
cleaned = data.set_index('product_code').technology_type.str.split(',', expand=True).stack()
输出:
product_code
1 0 4G
1 4G LAA
2 5G NR
2 0 4G
1 4G CBRS
2 5G FIXED
3 0 4G
1 5G
2 NR
4 0 4G
1 NR
然后您可以将 get_dummies()
和 merge
应用回您的数据。
technology_type_dummies = pd.get_dummies(cleaned).groupby(level=0).sum()
newData = data.merge(technology_type_dummies, left_on='product_code', right_index=True)
输出:
product_code technology_type 4G LAA 5G 5G NR NR 4G 4G CBRS 5G FIXED
1 4G, 4G LAA, 5G NR 1 0 1 0 1 0 0
2 4G,4G CBRS,5G FIXED 0 0 0 0 1 1 1
3 4G, 5G, NR 0 1 0 1 1 0 0
4 4G, NR 0 0 0 1 1 0 0
为了以防万一,请记住删除列名开头和结尾的白色 space。
newData.columns = newData.columns.str.strip()
然后您可以删除 technology_type
列。虚拟列的数据类型是整数,因此它不会存在于代码中的 features_to_encode
中。
我正在尝试拟合 RandomForestClassifier,就像这样。
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(col_trans, rf_classifier)
pipe.fit(X_train, y_train)
我收到这个错误:
ValueError: Found unknown categories ['4G, 4G LAA, 5G NR', '4G,4G CBRS,5G FIXED'] in column 3 during transform
名为 'technology_type'
的字段包含逗号分隔值,如下所示:4G, 5G, NR
如何处理这些逗号分隔值?我想我可以删除该字段,但我真的想将它作为 X 的独立变量包含在内。
这是我的全部代码。
df_fuze = pd.read_sql("""select * from fuze""", conn_connection)
# copy features to new DF
fuze = df_fuze[['territory',
'submarket',
'local_market',
'technology_type',
'project_type',
'modification_type',
'objective',
'construction_completed_days']]
fuze.head()
# set dependent variable
y = fuze['construction_completed_days']
# set the independent variables
X = fuze.drop('construction_completed_days', 1)
seed = 50 # so that the result is reproducible
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.333, random_state = seed)
X_train = X_train.fillna('na')
X_test = X_test.fillna('na')
features_to_encode = list(X_train.select_dtypes(include = ['object']).columns)
# Or alternatively,
# features_to_encode = X_train.columns[X_train.dtypes==object].tolist()
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
col_trans = make_column_transformer(
(OneHotEncoder(),features_to_encode),
remainder = "passthrough"
)
from sklearn.ensemble import RandomForestClassifier
rf_classifier = RandomForestClassifier(
min_samples_leaf=50,
n_estimators=150,
bootstrap=True,
oob_score=True,
n_jobs=-1,
random_state=seed,
max_features='auto')
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(col_trans, rf_classifier)
pipe.fit(X_train, y_train)
尝试拟合 X 和 y 变量后发生错误。
我正在按照此处的示例进行操作。
https://towardsdatascience.com/my-random-forest-classifier-cheat-sheet-in-python-fedb84f8cf4f
假设您有这个数据集:
import pandas as pd
data = pd.DataFrame({'product_code': ['1', '2', '3', '4'],
'technology_type': ['4G, 4G LAA, 5G NR',
'4G,4G CBRS,5G FIXED',
'4G, 5G, NR',
'4G, NR']},
columns=['product_code', 'technology_type'])
输出:
product_code technology_type
1 4G, 4G LAA, 5G NR
2 4G,4G CBRS,5G FIXED
3 4G, 5G, NR
4 4G, NR
首先,您的数据应该一次包含一个 technology_type
类别。
cleaned = data.set_index('product_code').technology_type.str.split(',', expand=True).stack()
输出:
product_code
1 0 4G
1 4G LAA
2 5G NR
2 0 4G
1 4G CBRS
2 5G FIXED
3 0 4G
1 5G
2 NR
4 0 4G
1 NR
然后您可以将 get_dummies()
和 merge
应用回您的数据。
technology_type_dummies = pd.get_dummies(cleaned).groupby(level=0).sum()
newData = data.merge(technology_type_dummies, left_on='product_code', right_index=True)
输出:
product_code technology_type 4G LAA 5G 5G NR NR 4G 4G CBRS 5G FIXED
1 4G, 4G LAA, 5G NR 1 0 1 0 1 0 0
2 4G,4G CBRS,5G FIXED 0 0 0 0 1 1 1
3 4G, 5G, NR 0 1 0 1 1 0 0
4 4G, NR 0 0 0 1 1 0 0
为了以防万一,请记住删除列名开头和结尾的白色 space。
newData.columns = newData.columns.str.strip()
然后您可以删除 technology_type
列。虚拟列的数据类型是整数,因此它不会存在于代码中的 features_to_encode
中。