无法使用 XGBoost 进行预测
Unable to predict using XGBoost
我有一个程序使用 XGBoost 来预测二进制 class化。我已经完成了大部分代码,但最后我想使用用户定义的变量来预测 class,我遇到了问题。在共享代码之前,变量 'clf' 是我在执行 GridSearchCV 后选择的最佳 classifier:
def prob1(LIMIT_BAL, SEX, EDUCATION, MARRIAGE, AGE, PAY_0, PAY_2, PAY_3, PAY_4, PAY_5, PAY_6, BILL_AMT1, BILL_AMT2, BILL_AMT3,
BILL_AMT4, BILL_AMT5, BILL_AMT6, PAY_AMT1, PAY_AMT2, PAY_AMT3, PAY_AMT4, PAY_AMT5, PAY_AMT6):
#1) Store user entered information into a series, convert to dataframe, then transpose so that it is all in 1 row just like in training set.
lst = [LIMIT_BAL, SEX, EDUCATION, MARRIAGE, AGE, PAY_0, PAY_2, PAY_3, PAY_4, PAY_5, PAY_6, BILL_AMT1, BILL_AMT2, BILL_AMT3,
BILL_AMT4, BILL_AMT5, BILL_AMT6, PAY_AMT1, PAY_AMT2, PAY_AMT3, PAY_AMT4, PAY_AMT5, PAY_AMT6]
ud_df = pd.Series(lst)
ud_df = ud_df.to_frame()
ud_df = ud_df.T
#2) Perform the same normalization and factorization of the values as done when loading the data in above.
c = [1,2,3] # index of categorical data columns
r = list(range(0,23))
r = [x for x in r if x not in c] # get list of all other columns
df_cat = ud_df.iloc[:, [2,3,4]].copy()
df_con = ud_df.iloc[:, r].copy()
# factorize categorical data
for c in df_cat:
df_cat[c] = pd.factorize(df_cat[c])[0]
# scale continuous data
scaler = preprocessing.MinMaxScaler()
df_scaled = scaler.fit_transform(df_con)
df_scaled = pd.DataFrame(df_scaled, columns=df_con.columns)
df_final = pd.concat([df_cat, df_scaled], axis=1)
#reorder columns back to original order
cols = df.columns
df_final = df_final[cols]
#Predict
prediction = clf.predict(df_final)
#Predict Probability
probability_pred = clf.predict_probab(df_final)
return(prediction, probability_pred)
定义中发生的事情是用户给出这些变量,连续变量被归一化,分类变量通过因式分解得到处理。
当我运行这段代码时,我得到这个错误:
prob1(50000,1, 1, 1, 37,0,0,0,0,0,0,64400,57069,57608,19394,19619,20024,2500,1815,657,1000,1000,800)
错误代码:df_con = ud_df.iloc[:, r].copy()
IndexError: positional indexers are out-of-bounds
任何帮助都会很棒!
下面是一行在没有任何争论的情况下的外观示例:
[50000,1,1,2,37,0,0,0,0,0,0,64400,57069,57608,19394,
19619,20024,2500,1815,657,1000,1000,800]
Edit1:修复了原始代码中的边界。我收到此错误突出显示 prob1(.....) 列:
KeyError: "Index(['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0',\n 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',\n 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',\n 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6'],\n dtype='object') not in index"
您的列表变量有 23 个元素。
- r =
list(range(0,24))
有 24 个元素。 r = {0,1..23}
而当你使用iloc
根据索引查找udf
中的元素时,由于它只有23个元素,你找不到索引为23的元素,它越界为错误代码说。
我有一个程序使用 XGBoost 来预测二进制 class化。我已经完成了大部分代码,但最后我想使用用户定义的变量来预测 class,我遇到了问题。在共享代码之前,变量 'clf' 是我在执行 GridSearchCV 后选择的最佳 classifier:
def prob1(LIMIT_BAL, SEX, EDUCATION, MARRIAGE, AGE, PAY_0, PAY_2, PAY_3, PAY_4, PAY_5, PAY_6, BILL_AMT1, BILL_AMT2, BILL_AMT3,
BILL_AMT4, BILL_AMT5, BILL_AMT6, PAY_AMT1, PAY_AMT2, PAY_AMT3, PAY_AMT4, PAY_AMT5, PAY_AMT6):
#1) Store user entered information into a series, convert to dataframe, then transpose so that it is all in 1 row just like in training set.
lst = [LIMIT_BAL, SEX, EDUCATION, MARRIAGE, AGE, PAY_0, PAY_2, PAY_3, PAY_4, PAY_5, PAY_6, BILL_AMT1, BILL_AMT2, BILL_AMT3,
BILL_AMT4, BILL_AMT5, BILL_AMT6, PAY_AMT1, PAY_AMT2, PAY_AMT3, PAY_AMT4, PAY_AMT5, PAY_AMT6]
ud_df = pd.Series(lst)
ud_df = ud_df.to_frame()
ud_df = ud_df.T
#2) Perform the same normalization and factorization of the values as done when loading the data in above.
c = [1,2,3] # index of categorical data columns
r = list(range(0,23))
r = [x for x in r if x not in c] # get list of all other columns
df_cat = ud_df.iloc[:, [2,3,4]].copy()
df_con = ud_df.iloc[:, r].copy()
# factorize categorical data
for c in df_cat:
df_cat[c] = pd.factorize(df_cat[c])[0]
# scale continuous data
scaler = preprocessing.MinMaxScaler()
df_scaled = scaler.fit_transform(df_con)
df_scaled = pd.DataFrame(df_scaled, columns=df_con.columns)
df_final = pd.concat([df_cat, df_scaled], axis=1)
#reorder columns back to original order
cols = df.columns
df_final = df_final[cols]
#Predict
prediction = clf.predict(df_final)
#Predict Probability
probability_pred = clf.predict_probab(df_final)
return(prediction, probability_pred)
定义中发生的事情是用户给出这些变量,连续变量被归一化,分类变量通过因式分解得到处理。
当我运行这段代码时,我得到这个错误:
prob1(50000,1, 1, 1, 37,0,0,0,0,0,0,64400,57069,57608,19394,19619,20024,2500,1815,657,1000,1000,800)
错误代码:df_con = ud_df.iloc[:, r].copy()
IndexError: positional indexers are out-of-bounds
任何帮助都会很棒!
下面是一行在没有任何争论的情况下的外观示例: [50000,1,1,2,37,0,0,0,0,0,0,64400,57069,57608,19394, 19619,20024,2500,1815,657,1000,1000,800]
Edit1:修复了原始代码中的边界。我收到此错误突出显示 prob1(.....) 列:
KeyError: "Index(['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0',\n 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',\n 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',\n 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6'],\n dtype='object') not in index"
您的列表变量有 23 个元素。
- r =
list(range(0,24))
有 24 个元素。r = {0,1..23}
而当你使用iloc
根据索引查找udf
中的元素时,由于它只有23个元素,你找不到索引为23的元素,它越界为错误代码说。