值错误 X 有 24 个特征，但 DecisionTreeClassifier 期望 19 个特征作为输入

Question

我正在尝试在我的机器上重现 this GitHub 项目，拓扑数据分析 (TDA)。

我的脚步:

从交叉验证输出中获取最佳参数
加载我的数据集特征选择
从数据集中提取拓扑特征进行预测
创建基于最佳参数的随机森林分类器模型
计算测试数据的概率

背景:

特征选择

In order to decide which attributes belong to which group, we created a correlation matrix. From this, we saw that there were two big groups, where player attributes were strongly correlated with each other. Therefore, we decided to split the attributes into two groups, one to summarise the attacking characteristics of a player while the other one the defensiveness. Finally, since the goalkeeper has completely different statistics with respect to the other players, we decided to take into account only the overall rating. Below, is possible to see the 24 features used for each player:

Attack: "positioning", "crossing", "finishing", "heading_accuracy", "short_passing", "reactions", "volleys", "dribbling", "curve", "free_kick_accuracy", "acceleration", "sprint_speed", "agility", "penalties", "vision", "shot_power", "long_shots" Defense: "interceptions", "aggression", "marking", "standing_tackle", "sliding_tackle", "long_passing" Goalkeeper: "overall_rating"

From this set of features, the next step we did was to, for each non-goalkeeper player, compute the mean of the attack attributes and the defensive ones.

Finally, for each team in a given match, we compute the mean and the standard deviation for the attack and the defense from these stats of the team's players, as well as the best attack and best defense.

In this way a match is described by 14 features (GK overall value, best attack, std attack, mean attack, the best defense, std defense, mean defense), that mapped the match in the space, following the characterizes of the two teams.

特征提取

The aim of TDA is to catch the structure of the space underlying the data. In our project, we assume that the neighborhood of a data point hides meaningful information that is correlated with the outcome of the match. Thus, we explored the data space looking for this kind of correlation.

方法:

def get_best_params():
    cv_output = read_pickle('cv_output.pickle')
    best_model_params, top_feat_params, top_model_feat_params, *_ = cv_output

    return top_feat_params, top_model_feat_params

def load_dataset():
    x_y = get_dataset(42188).get_data(dataset_format='array')[0]
    x_train_with_topo = x_y[:, :-1]
    y_train = x_y[:, -1]

    return x_train_with_topo, y_train


def extract_x_test_features(x_train, y_train, players_df, pipeline):
    """Extract the topological features from the test set. This requires also the train set

    Parameters
    ----------
    x_train:
        The x used in the training phase
    y_train:
        The 'y' used in the training phase
    players_df: pd.DataFrame
        The DataFrame containing the matches with all the players, from which to extract the test set
    pipeline: Pipeline
        The Giotto pipeline

    Returns
    -------
    x_test:
        The x_test with the topological features
    """
    x_train_no_topo = x_train[:, :14]
    y_test = np.zeros(len(players_df))  # Artificial y_test for features computation
    print('Y_TEST',y_test.shape)

    x_test_topo = extract_features_for_prediction(x_train_no_topo, y_train, players_df.values, y_test, pipeline)

    return x_test_topo

def extract_topological_features(diagrams):
    metrics = ['bottleneck', 'wasserstein', 'landscape', 'betti', 'heat']
    new_features = []
    for metric in metrics:
        amplitude = Amplitude(metric=metric)
        new_features.append(amplitude.fit_transform(diagrams))
    new_features = np.concatenate(new_features, axis=1)
    return new_features

def extract_features_for_prediction(x_train, y_train, x_test, y_test, pipeline):
    shift = 10
    top_features = []
    all_x_train = x_train
    all_y_train = y_train
    for i in tqdm(range(0, len(x_test), shift)):
        #
        print(range(0, len(x_test), shift) )
        if i+shift > len(x_test):
            shift = len(x_test) - i
        batch = np.concatenate([all_x_train, x_test[i: i + shift]])
        batch_y = np.concatenate([all_y_train, y_test[i: i + shift].reshape((-1,))])
        diagrams_batch, _ = pipeline.fit_transform_resample(batch, batch_y)
        new_features_batch = extract_topological_features(diagrams_batch[-shift:])
        top_features.append(new_features_batch)
        all_x_train = np.concatenate([all_x_train, batch[-shift:]])
        all_y_train = np.concatenate([all_y_train, batch_y[-shift:]])
    final_x_test = np.concatenate([x_test, np.concatenate(top_features, axis=0)], axis=1)
    return final_x_test

def get_probabilities(model, x_test, team_ids):
    """Get the probabilities on the outcome of the matches contained in the test set

    Parameters
    ----------
    model:
        The model (must have the 'predict_proba' function)
    x_test:
        The test set
    team_ids: pd.DataFrame
        The DataFrame containing, for each match in the test set, the ids of the two teams
    Returns
    -------
    probabilities:
        The probabilities for each match in the test set
    """
    prob_pred = model.predict_proba(x_test)
    prob_match_df = pd.DataFrame(data=prob_pred, columns=['away_team_prob', 'draw_prob', 'home_team_prob'])
    prob_match_df = pd.concat([team_ids.reset_index(drop=True), prob_match_df], axis=1)
    return prob_match_df

工作代码:

best_pipeline_params, best_model_feat_params = get_best_params()

# 'best_pipeline_params' -> {'k_min': 50, 'k_max': 175, 'dist_percentage': 0.1}
# best_model_feat_params -> {'n_estimators': 1000, 'max_depth': 10, 'random_state': 52, 'max_features': 0.5}

pipeline = get_pipeline(best_pipeline_params)
# pipeline -> Pipeline(steps=[('extract_point_clouds',
            # SubSpaceExtraction(dist_percentage=0.1, k_max=175, k_min=50)),
            #('create_diagrams', VietorisRipsPersistence(n_jobs=-1))])

x_train, y_train = load_dataset()

# x_train.shape ->  (2565, 19)
# y_train.shape -> (2565,)

x_test = extract_x_test_features(x_train, y_train, new_players_df_stats, pipeline)

# x_test.shape -> (380, 24)

rf_model = RandomForestClassifier(**best_model_feat_params)
rf_model.fit(x_train, y_train)
matches_probabilities = get_probabilities(rf_model, x_test, team_ids)  # <-- breaks here
matches_probabilities.head()
compute_final_standings(matches_probabilities, 'premier league')

但我收到错误消息：

ValueError: X has 24 features, but DecisionTreeClassifier is expecting 19 features as input.

已加载数据集（X_train）：

Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   home_best_attack    2565 non-null   float64
 1   home_best_defense   2565 non-null   float64
 2   home_avg_attack     2565 non-null   float64
 3   home_avg_defense    2565 non-null   float64
 4   home_std_attack     2565 non-null   float64
 5   home_std_defense    2565 non-null   float64
 6   gk_home_player_1    2565 non-null   float64
 7   away_avg_attack     2565 non-null   float64
 8   away_avg_defense    2565 non-null   float64
 9   away_std_attack     2565 non-null   float64
 10  away_std_defense    2565 non-null   float64
 11  away_best_attack    2565 non-null   float64
 12  away_best_defense   2565 non-null   float64
 13  gk_away_player_1    2565 non-null   float64
 14  bottleneck_metric   2565 non-null   float64
 15  wasserstein_metric  2565 non-null   float64
 16  landscape_metric    2565 non-null   float64
 17  betti_metric        2565 non-null   float64
 18  heat_metric         2565 non-null   float64
 19  label               2565 non-null   float64

请注意，前 14 列是描述匹配的特征，其余 5 个特征（减去标签）是拓扑特征，已被提取。

问题似乎是当代码到达 extract_x_test_features() 和 extract_features_for_prediction() 时，它应该获得拓扑特征并将训练数据集与其叠加。

由于 X_train 已经有拓扑特征，它又增加了 5 个，所以我最终得到 24 个特征。

不过我不确定。我只是想把这个项目围绕在我的脑海中……以及这里是如何进行预测的。

如何使用上面的代码修复不匹配？

注释:

1- x_train 和 y_test 不是 dataframes 而是 numpy.ndarray

2 - 如果从以下 link:

克隆或下载项目，则此问题完全可以重现

Github Link

Answer 1

其实题中已经给出了答案

您在问题中提到了 # x_test.shape -> (380, 24) 和 # x_train.shape -> (2565, 19)。由于非常清楚并且可以看出您的测试数据形状与您的火车数据不匹配，因此您的火车数据具有 19 特征，而测试数据具有 24 特征（它们必须包含相同数量的功能）因此，当您在此行的模型中给出 x_test 时，您会收到错误 "X has 24 features, but DecisionTreeClassifier is expecting 19 features as input" - get_probabilities(rf_model, x_test, team_ids).

因此，您的测试数据必须具有 24 个特征，就像您的训练数据一样。

Answer 2

在你的 x_train 中你有 19 个特征，而在 X_test 中你有 24 个特征？这是为什么？

要解决这个问题，请显示两个数据框（x_train 和 X_test）并尝试找出它们具有不同特征的原因。最后，您必须在每个数据框中 具有相同的形状和相同的特征 。否则，您将收到此错误。

可能是您导入的数据集有误。

Answer 3

以下是如何使用 RandomSearchCV 为您的模型找到最佳参数

pipeline2= Pipeline([
     ('scaler',StandardScaler()),
     ('clf',RandomForestClassifier(n_estimators=62,max_depth=16)),
])

# cycle through your pickle file parameter combinations here:
  param_grid = {'n_estimators': list(range(30,100)), 'max_depth': list(range(5,26)),      'max_features': ['auto' , 'sqrt']} 

  random_rf_class = RandomizedSearchCV(
      estimator = pipeline2['clf'],
      param_distributions= param_grid,
      n_iter = 10,
      scoring='accuracy', n_jobs=2, cv = 10, refit=True, return_train_score = True)

  random_rf_class.fit(X_train,y_train)

  predictions=random_rf_class.predict(X_test)

  print("Model accuracy {}%".format(accuracy_score(y_test,predictions)*100))

  # Print the values used for both hyperparameters
  print(random_rf_class.cv_results_['param_max_depth'])
  print(random_rf_class.cv_results_['param_max_features'])

  print(random_rf_class.best_params_)
  print(random_rf_class.best_score_)

Answer 4

在此处返回具有 19 个特征的切片：

def extract_features_for_prediction(x_train, y_train, x_test, y_test, pipeline):
   (...)
   return final_x_test[:, :19]

消除了错误并运行进行了测试。

不过我还是不明白它的要点。

我将悬赏运行向任何在这个项目的上下文中向我解释测试集背后的想法的人，在项目笔记本中，可以在这里找到：

Project Notebook

值错误 X 有 24 个特征，但 DecisionTreeClassifier 期望 19 个特征作为输入

Value Error X has 24 features, but DecisionTreeClassifier is expecting 19 features as input

python

decision-tree

topological-sort

cross-validation