值错误 X 有 24 个特征,但 DecisionTreeClassifier 期望 19 个特征作为输入
Value Error X has 24 features, but DecisionTreeClassifier is expecting 19 features as input
我正在尝试在我的机器上重现 this GitHub 项目,拓扑数据分析 (TDA)。
我的脚步:
- 从交叉验证输出中获取最佳参数
- 加载我的数据集特征选择
- 从数据集中提取拓扑特征进行预测
- 创建基于最佳参数的随机森林分类器模型
- 计算测试数据的概率
背景:
- 特征选择
In order to decide which attributes belong to which group, we created a correlation matrix.
From this, we saw that there were two big groups, where player attributes were strongly correlated with each other. Therefore, we decided to split the attributes into two groups,
one to summarise the attacking characteristics of a player while the other one the defensiveness. Finally, since the goalkeeper has completely different statistics with respect to the
other players, we decided to take into account only the overall rating. Below, is possible
to see the 24 features used for each player:
Attack: "positioning", "crossing", "finishing", "heading_accuracy", "short_passing",
"reactions", "volleys", "dribbling", "curve", "free_kick_accuracy", "acceleration",
"sprint_speed", "agility", "penalties", "vision", "shot_power", "long_shots"
Defense: "interceptions", "aggression", "marking", "standing_tackle", "sliding_tackle",
"long_passing"
Goalkeeper: "overall_rating"
From this set of features, the next step we did was to, for each non-goalkeeper player,
compute the mean of the attack attributes and the defensive ones.
Finally, for each team in a given match, we compute the mean and the standard deviation
for the attack and the defense from these stats of the team's players, as well as the best
attack and best defense.
In this way a match is described by 14 features (GK overall value, best attack, std attack,
mean attack, the best defense, std defense, mean defense), that mapped the match in the space,
following the characterizes of the two teams.
- 特征提取
The aim of TDA is to catch the structure of the space underlying the data. In our project, we assume that the neighborhood of a data point hides meaningful information that is correlated with the outcome of the match. Thus, we explored the data space looking for
this kind of correlation.
方法:
def get_best_params():
cv_output = read_pickle('cv_output.pickle')
best_model_params, top_feat_params, top_model_feat_params, *_ = cv_output
return top_feat_params, top_model_feat_params
def load_dataset():
x_y = get_dataset(42188).get_data(dataset_format='array')[0]
x_train_with_topo = x_y[:, :-1]
y_train = x_y[:, -1]
return x_train_with_topo, y_train
def extract_x_test_features(x_train, y_train, players_df, pipeline):
"""Extract the topological features from the test set. This requires also the train set
Parameters
----------
x_train:
The x used in the training phase
y_train:
The 'y' used in the training phase
players_df: pd.DataFrame
The DataFrame containing the matches with all the players, from which to extract the test set
pipeline: Pipeline
The Giotto pipeline
Returns
-------
x_test:
The x_test with the topological features
"""
x_train_no_topo = x_train[:, :14]
y_test = np.zeros(len(players_df)) # Artificial y_test for features computation
print('Y_TEST',y_test.shape)
x_test_topo = extract_features_for_prediction(x_train_no_topo, y_train, players_df.values, y_test, pipeline)
return x_test_topo
def extract_topological_features(diagrams):
metrics = ['bottleneck', 'wasserstein', 'landscape', 'betti', 'heat']
new_features = []
for metric in metrics:
amplitude = Amplitude(metric=metric)
new_features.append(amplitude.fit_transform(diagrams))
new_features = np.concatenate(new_features, axis=1)
return new_features
def extract_features_for_prediction(x_train, y_train, x_test, y_test, pipeline):
shift = 10
top_features = []
all_x_train = x_train
all_y_train = y_train
for i in tqdm(range(0, len(x_test), shift)):
#
print(range(0, len(x_test), shift) )
if i+shift > len(x_test):
shift = len(x_test) - i
batch = np.concatenate([all_x_train, x_test[i: i + shift]])
batch_y = np.concatenate([all_y_train, y_test[i: i + shift].reshape((-1,))])
diagrams_batch, _ = pipeline.fit_transform_resample(batch, batch_y)
new_features_batch = extract_topological_features(diagrams_batch[-shift:])
top_features.append(new_features_batch)
all_x_train = np.concatenate([all_x_train, batch[-shift:]])
all_y_train = np.concatenate([all_y_train, batch_y[-shift:]])
final_x_test = np.concatenate([x_test, np.concatenate(top_features, axis=0)], axis=1)
return final_x_test
def get_probabilities(model, x_test, team_ids):
"""Get the probabilities on the outcome of the matches contained in the test set
Parameters
----------
model:
The model (must have the 'predict_proba' function)
x_test:
The test set
team_ids: pd.DataFrame
The DataFrame containing, for each match in the test set, the ids of the two teams
Returns
-------
probabilities:
The probabilities for each match in the test set
"""
prob_pred = model.predict_proba(x_test)
prob_match_df = pd.DataFrame(data=prob_pred, columns=['away_team_prob', 'draw_prob', 'home_team_prob'])
prob_match_df = pd.concat([team_ids.reset_index(drop=True), prob_match_df], axis=1)
return prob_match_df
工作代码:
best_pipeline_params, best_model_feat_params = get_best_params()
# 'best_pipeline_params' -> {'k_min': 50, 'k_max': 175, 'dist_percentage': 0.1}
# best_model_feat_params -> {'n_estimators': 1000, 'max_depth': 10, 'random_state': 52, 'max_features': 0.5}
pipeline = get_pipeline(best_pipeline_params)
# pipeline -> Pipeline(steps=[('extract_point_clouds',
# SubSpaceExtraction(dist_percentage=0.1, k_max=175, k_min=50)),
#('create_diagrams', VietorisRipsPersistence(n_jobs=-1))])
x_train, y_train = load_dataset()
# x_train.shape -> (2565, 19)
# y_train.shape -> (2565,)
x_test = extract_x_test_features(x_train, y_train, new_players_df_stats, pipeline)
# x_test.shape -> (380, 24)
rf_model = RandomForestClassifier(**best_model_feat_params)
rf_model.fit(x_train, y_train)
matches_probabilities = get_probabilities(rf_model, x_test, team_ids) # <-- breaks here
matches_probabilities.head()
compute_final_standings(matches_probabilities, 'premier league')
但我收到错误消息:
ValueError: X has 24 features, but DecisionTreeClassifier is expecting 19 features as input.
已加载数据集(X_train
):
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 home_best_attack 2565 non-null float64
1 home_best_defense 2565 non-null float64
2 home_avg_attack 2565 non-null float64
3 home_avg_defense 2565 non-null float64
4 home_std_attack 2565 non-null float64
5 home_std_defense 2565 non-null float64
6 gk_home_player_1 2565 non-null float64
7 away_avg_attack 2565 non-null float64
8 away_avg_defense 2565 non-null float64
9 away_std_attack 2565 non-null float64
10 away_std_defense 2565 non-null float64
11 away_best_attack 2565 non-null float64
12 away_best_defense 2565 non-null float64
13 gk_away_player_1 2565 non-null float64
14 bottleneck_metric 2565 non-null float64
15 wasserstein_metric 2565 non-null float64
16 landscape_metric 2565 non-null float64
17 betti_metric 2565 non-null float64
18 heat_metric 2565 non-null float64
19 label 2565 non-null float64
请注意,前 14 列是描述匹配的特征,其余 5 个特征(减去标签)是拓扑特征,已被提取。
问题似乎是当代码到达 extract_x_test_features()
和 extract_features_for_prediction()
时,它应该获得拓扑特征并将训练数据集与其叠加。
由于 X_train 已经有拓扑特征,它又增加了 5 个,所以我最终得到 24 个特征。
不过我不确定。我只是想把这个项目围绕在我的脑海中……以及这里是如何进行预测的。
如何使用上面的代码修复不匹配?
注释:
1- x_train 和 y_test 不是 dataframes
而是 numpy.ndarray
2 - 如果从以下 link:
克隆或下载项目,则此问题完全可以重现
其实题中已经给出了答案
您在问题中提到了 # x_test.shape -> (380, 24)
和 # x_train.shape -> (2565, 19)
。由于非常清楚并且可以看出您的测试数据形状与您的火车数据不匹配,因此您的火车数据具有 19
特征,而测试数据具有 24
特征(它们必须包含相同数量的功能)因此,当您在此行的模型中给出 x_test
时,您会收到错误 "X has 24 features, but DecisionTreeClassifier is expecting 19 features as input"
- get_probabilities(rf_model, x_test, team_ids)
.
因此,您的测试数据必须具有 24 个特征,就像您的训练数据一样。
在你的 x_train 中你有 19 个特征,而在 X_test 中你有 24 个特征?这是为什么?
要解决这个问题,请显示两个数据框(x_train 和 X_test)并尝试找出它们具有不同特征的原因。最后,您必须在每个数据框中 具有相同的形状和相同的特征 。否则,您将收到此错误。
可能是您导入的数据集有误。
以下是如何使用 RandomSearchCV 为您的模型找到最佳参数
pipeline2= Pipeline([
('scaler',StandardScaler()),
('clf',RandomForestClassifier(n_estimators=62,max_depth=16)),
])
# cycle through your pickle file parameter combinations here:
param_grid = {'n_estimators': list(range(30,100)), 'max_depth': list(range(5,26)), 'max_features': ['auto' , 'sqrt']}
random_rf_class = RandomizedSearchCV(
estimator = pipeline2['clf'],
param_distributions= param_grid,
n_iter = 10,
scoring='accuracy', n_jobs=2, cv = 10, refit=True, return_train_score = True)
random_rf_class.fit(X_train,y_train)
predictions=random_rf_class.predict(X_test)
print("Model accuracy {}%".format(accuracy_score(y_test,predictions)*100))
# Print the values used for both hyperparameters
print(random_rf_class.cv_results_['param_max_depth'])
print(random_rf_class.cv_results_['param_max_features'])
print(random_rf_class.best_params_)
print(random_rf_class.best_score_)
在此处返回具有 19 个特征的切片:
def extract_features_for_prediction(x_train, y_train, x_test, y_test, pipeline):
(...)
return final_x_test[:, :19]
消除了错误并 运行 进行了测试。
不过我还是不明白它的要点。
我将悬赏运行向任何在这个项目的上下文中向我解释测试集背后的想法的人,在项目笔记本中,可以在这里找到:
我正在尝试在我的机器上重现 this GitHub 项目,拓扑数据分析 (TDA)。
我的脚步:
- 从交叉验证输出中获取最佳参数
- 加载我的数据集特征选择
- 从数据集中提取拓扑特征进行预测
- 创建基于最佳参数的随机森林分类器模型
- 计算测试数据的概率
背景:
- 特征选择
In order to decide which attributes belong to which group, we created a correlation matrix. From this, we saw that there were two big groups, where player attributes were strongly correlated with each other. Therefore, we decided to split the attributes into two groups, one to summarise the attacking characteristics of a player while the other one the defensiveness. Finally, since the goalkeeper has completely different statistics with respect to the other players, we decided to take into account only the overall rating. Below, is possible to see the 24 features used for each player:
Attack: "positioning", "crossing", "finishing", "heading_accuracy", "short_passing", "reactions", "volleys", "dribbling", "curve", "free_kick_accuracy", "acceleration", "sprint_speed", "agility", "penalties", "vision", "shot_power", "long_shots" Defense: "interceptions", "aggression", "marking", "standing_tackle", "sliding_tackle", "long_passing" Goalkeeper: "overall_rating"
From this set of features, the next step we did was to, for each non-goalkeeper player, compute the mean of the attack attributes and the defensive ones.
Finally, for each team in a given match, we compute the mean and the standard deviation for the attack and the defense from these stats of the team's players, as well as the best attack and best defense.
In this way a match is described by 14 features (GK overall value, best attack, std attack, mean attack, the best defense, std defense, mean defense), that mapped the match in the space, following the characterizes of the two teams.
- 特征提取
The aim of TDA is to catch the structure of the space underlying the data. In our project, we assume that the neighborhood of a data point hides meaningful information that is correlated with the outcome of the match. Thus, we explored the data space looking for this kind of correlation.
方法:
def get_best_params():
cv_output = read_pickle('cv_output.pickle')
best_model_params, top_feat_params, top_model_feat_params, *_ = cv_output
return top_feat_params, top_model_feat_params
def load_dataset():
x_y = get_dataset(42188).get_data(dataset_format='array')[0]
x_train_with_topo = x_y[:, :-1]
y_train = x_y[:, -1]
return x_train_with_topo, y_train
def extract_x_test_features(x_train, y_train, players_df, pipeline):
"""Extract the topological features from the test set. This requires also the train set
Parameters
----------
x_train:
The x used in the training phase
y_train:
The 'y' used in the training phase
players_df: pd.DataFrame
The DataFrame containing the matches with all the players, from which to extract the test set
pipeline: Pipeline
The Giotto pipeline
Returns
-------
x_test:
The x_test with the topological features
"""
x_train_no_topo = x_train[:, :14]
y_test = np.zeros(len(players_df)) # Artificial y_test for features computation
print('Y_TEST',y_test.shape)
x_test_topo = extract_features_for_prediction(x_train_no_topo, y_train, players_df.values, y_test, pipeline)
return x_test_topo
def extract_topological_features(diagrams):
metrics = ['bottleneck', 'wasserstein', 'landscape', 'betti', 'heat']
new_features = []
for metric in metrics:
amplitude = Amplitude(metric=metric)
new_features.append(amplitude.fit_transform(diagrams))
new_features = np.concatenate(new_features, axis=1)
return new_features
def extract_features_for_prediction(x_train, y_train, x_test, y_test, pipeline):
shift = 10
top_features = []
all_x_train = x_train
all_y_train = y_train
for i in tqdm(range(0, len(x_test), shift)):
#
print(range(0, len(x_test), shift) )
if i+shift > len(x_test):
shift = len(x_test) - i
batch = np.concatenate([all_x_train, x_test[i: i + shift]])
batch_y = np.concatenate([all_y_train, y_test[i: i + shift].reshape((-1,))])
diagrams_batch, _ = pipeline.fit_transform_resample(batch, batch_y)
new_features_batch = extract_topological_features(diagrams_batch[-shift:])
top_features.append(new_features_batch)
all_x_train = np.concatenate([all_x_train, batch[-shift:]])
all_y_train = np.concatenate([all_y_train, batch_y[-shift:]])
final_x_test = np.concatenate([x_test, np.concatenate(top_features, axis=0)], axis=1)
return final_x_test
def get_probabilities(model, x_test, team_ids):
"""Get the probabilities on the outcome of the matches contained in the test set
Parameters
----------
model:
The model (must have the 'predict_proba' function)
x_test:
The test set
team_ids: pd.DataFrame
The DataFrame containing, for each match in the test set, the ids of the two teams
Returns
-------
probabilities:
The probabilities for each match in the test set
"""
prob_pred = model.predict_proba(x_test)
prob_match_df = pd.DataFrame(data=prob_pred, columns=['away_team_prob', 'draw_prob', 'home_team_prob'])
prob_match_df = pd.concat([team_ids.reset_index(drop=True), prob_match_df], axis=1)
return prob_match_df
工作代码:
best_pipeline_params, best_model_feat_params = get_best_params()
# 'best_pipeline_params' -> {'k_min': 50, 'k_max': 175, 'dist_percentage': 0.1}
# best_model_feat_params -> {'n_estimators': 1000, 'max_depth': 10, 'random_state': 52, 'max_features': 0.5}
pipeline = get_pipeline(best_pipeline_params)
# pipeline -> Pipeline(steps=[('extract_point_clouds',
# SubSpaceExtraction(dist_percentage=0.1, k_max=175, k_min=50)),
#('create_diagrams', VietorisRipsPersistence(n_jobs=-1))])
x_train, y_train = load_dataset()
# x_train.shape -> (2565, 19)
# y_train.shape -> (2565,)
x_test = extract_x_test_features(x_train, y_train, new_players_df_stats, pipeline)
# x_test.shape -> (380, 24)
rf_model = RandomForestClassifier(**best_model_feat_params)
rf_model.fit(x_train, y_train)
matches_probabilities = get_probabilities(rf_model, x_test, team_ids) # <-- breaks here
matches_probabilities.head()
compute_final_standings(matches_probabilities, 'premier league')
但我收到错误消息:
ValueError: X has 24 features, but DecisionTreeClassifier is expecting 19 features as input.
已加载数据集(X_train
):
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 home_best_attack 2565 non-null float64
1 home_best_defense 2565 non-null float64
2 home_avg_attack 2565 non-null float64
3 home_avg_defense 2565 non-null float64
4 home_std_attack 2565 non-null float64
5 home_std_defense 2565 non-null float64
6 gk_home_player_1 2565 non-null float64
7 away_avg_attack 2565 non-null float64
8 away_avg_defense 2565 non-null float64
9 away_std_attack 2565 non-null float64
10 away_std_defense 2565 non-null float64
11 away_best_attack 2565 non-null float64
12 away_best_defense 2565 non-null float64
13 gk_away_player_1 2565 non-null float64
14 bottleneck_metric 2565 non-null float64
15 wasserstein_metric 2565 non-null float64
16 landscape_metric 2565 non-null float64
17 betti_metric 2565 non-null float64
18 heat_metric 2565 non-null float64
19 label 2565 non-null float64
请注意,前 14 列是描述匹配的特征,其余 5 个特征(减去标签)是拓扑特征,已被提取。
问题似乎是当代码到达 extract_x_test_features()
和 extract_features_for_prediction()
时,它应该获得拓扑特征并将训练数据集与其叠加。
由于 X_train 已经有拓扑特征,它又增加了 5 个,所以我最终得到 24 个特征。
不过我不确定。我只是想把这个项目围绕在我的脑海中……以及这里是如何进行预测的。
如何使用上面的代码修复不匹配?
注释:
1- x_train 和 y_test 不是 dataframes
而是 numpy.ndarray
2 - 如果从以下 link:
克隆或下载项目,则此问题完全可以重现其实题中已经给出了答案
您在问题中提到了 # x_test.shape -> (380, 24)
和 # x_train.shape -> (2565, 19)
。由于非常清楚并且可以看出您的测试数据形状与您的火车数据不匹配,因此您的火车数据具有 19
特征,而测试数据具有 24
特征(它们必须包含相同数量的功能)因此,当您在此行的模型中给出 x_test
时,您会收到错误 "X has 24 features, but DecisionTreeClassifier is expecting 19 features as input"
- get_probabilities(rf_model, x_test, team_ids)
.
因此,您的测试数据必须具有 24 个特征,就像您的训练数据一样。
在你的 x_train 中你有 19 个特征,而在 X_test 中你有 24 个特征?这是为什么?
要解决这个问题,请显示两个数据框(x_train 和 X_test)并尝试找出它们具有不同特征的原因。最后,您必须在每个数据框中 具有相同的形状和相同的特征 。否则,您将收到此错误。
可能是您导入的数据集有误。
以下是如何使用 RandomSearchCV 为您的模型找到最佳参数
pipeline2= Pipeline([
('scaler',StandardScaler()),
('clf',RandomForestClassifier(n_estimators=62,max_depth=16)),
])
# cycle through your pickle file parameter combinations here:
param_grid = {'n_estimators': list(range(30,100)), 'max_depth': list(range(5,26)), 'max_features': ['auto' , 'sqrt']}
random_rf_class = RandomizedSearchCV(
estimator = pipeline2['clf'],
param_distributions= param_grid,
n_iter = 10,
scoring='accuracy', n_jobs=2, cv = 10, refit=True, return_train_score = True)
random_rf_class.fit(X_train,y_train)
predictions=random_rf_class.predict(X_test)
print("Model accuracy {}%".format(accuracy_score(y_test,predictions)*100))
# Print the values used for both hyperparameters
print(random_rf_class.cv_results_['param_max_depth'])
print(random_rf_class.cv_results_['param_max_features'])
print(random_rf_class.best_params_)
print(random_rf_class.best_score_)
在此处返回具有 19 个特征的切片:
def extract_features_for_prediction(x_train, y_train, x_test, y_test, pipeline):
(...)
return final_x_test[:, :19]
消除了错误并 运行 进行了测试。
不过我还是不明白它的要点。
我将悬赏运行向任何在这个项目的上下文中向我解释测试集背后的想法的人,在项目笔记本中,可以在这里找到: