如何使用 Featuretools 按列值从单个数据框中的多列创建特征?

How to use Featuretools to create features from multiple columns in single dataframe by column values?

我正在尝试根据之前的结果预测足球比赛的结果。我是 运行 Python 3.6 Windows 并使用 Featuretools 0.4.1.

假设我有以下表示结果历史记录的数据框。

Original DataFame

使用上面的数据框,我想创建以下数据框,它将作为 X 提供给机器学习算法。请注意,尽管过去的比赛场地不同,但主客场球队的平均进球数需要按球队计算。有没有办法使用 Featuretools 创建这样的数据框?

Resulting Dataframe

Excel 可以找到用于模拟转换的文件 here

这是一个棘手的功能,但在 Featuretools 中很好地使用了自定义原语。

第一步是将匹配项的 CSV 加载到 Featuretools 实体集中

es = ft.EntitySet()
matches_df = pd.read_csv("./matches.csv")
es.entity_from_dataframe(entity_id="matches",
                         index="match_id",
                         time_index="match_date",
                         dataframe=matches_df)

然后我们定义一个自定义转换原语,用于计算最近 n 场比赛的平均进球数。它有一个参数来控制过去的比赛次数以及是否为主队或客队计算。有关定义自定义基元的信息在我们的文档 here and here 中。

from featuretools.variable_types import Numeric, Categorical
from featuretools.primitives import make_trans_primitive

def avg_goals_previous_n_games(home_team, away_team, home_goals, away_goals, which_team=None, n=1):
    # make dataframe so it's easier to work with
    df = pd.DataFrame({
        "home_team": home_team,
        "away_team": away_team,
        "home_goals": home_goals,
        "away_goals": away_goals
        })

    result = []
    for i, current_game in df.iterrows():
        # get the right team for this game
        team = current_game[which_team]

        # find all previous games that have been played
        prev_games =  df.iloc[:i]

        # only get games the team participated in
        participated = prev_games[(prev_games["home_team"] == team) | (prev_games["away_team"] == team)]
        if participated.shape[0] < n:
            result.append(None)
            continue

        # get last n games
        last_n = participated.tail(n)

        # calculate games per game
        goal_as_home = (last_n["home_team"] == team) * last_n["home_goals"]
        goal_as_away = (last_n["away_team"] == team) * last_n["away_goals"]

        # calculate mean across all home and away games
        mean = (goal_as_home + goal_as_away).mean()

        result.append(mean)

    return result

# custom function so the name of the feature prints out correctly
def make_name(self):
    return "%s_goal_last_%d" % (self.kwargs['which_team'], self.kwargs['n'])


AvgGoalPreviousNGames = make_trans_primitive(function=avg_goals_previous_n_games,
                                          input_types=[Categorical, Categorical, Numeric, Numeric],
                                          return_type=Numeric,
                                          cls_attributes={"generate_name": make_name, "uses_full_entity":True})

现在我们可以使用这个原语来定义特征了。在这种情况下,我们将不得不手动完成。

input_vars = [es["matches"]["home_team"], es["matches"]["away_team"], es["matches"]["home_goals"], es["matches"]["away_goals"]]
home_team_last1 = AvgGoalPreviousNGames(*input_vars, which_team="home_team", n=1)
home_team_last3 = AvgGoalPreviousNGames(*input_vars, which_team="home_team", n=3)
home_team_last5 = AvgGoalPreviousNGames(*input_vars, which_team="home_team", n=5)
away_team_last1 = AvgGoalPreviousNGames(*input_vars, which_team="away_team", n=1)
away_team_last3 = AvgGoalPreviousNGames(*input_vars, which_team="away_team", n=3)
away_team_last5 = AvgGoalPreviousNGames(*input_vars, which_team="away_team", n=5)

features = [home_team_last1, home_team_last3, home_team_last5,
            away_team_last1, away_team_last3, away_team_last5]

最后,我们可以计算特征矩阵

fm = ft.calculate_feature_matrix(entityset=es, features=features)

这个returns

          home_team_goal_last_1  home_team_goal_last_3  home_team_goal_last_5  away_team_goal_last_1  away_team_goal_last_3  away_team_goal_last_5
match_id                                                                                                                                          
1                           NaN                    NaN                    NaN                    NaN                    NaN                    NaN
2                           2.0                    NaN                    NaN                    0.0                    NaN                    NaN
3                           1.0                    NaN                    NaN                    0.0                    NaN                    NaN
4                           3.0               1.000000                    NaN                    0.0               1.000000                    NaN
5                           1.0               1.333333                    NaN                    1.0               0.666667                    NaN
6                           2.0               2.000000                    1.2                    0.0               0.333333                    0.8
7                           1.0               0.666667                    0.6                    2.0               1.666667                    1.6
8                           2.0               1.000000                    0.8                    2.0               2.000000                    2.0
9                           0.0               1.000000                    0.8                    1.0               1.666667                    1.6
10                          3.0               2.000000                    2.0                    1.0               1.000000                    0.8
11                          3.0               2.333333                    2.2                    1.0               0.666667                    1.0
12                          2.0               2.666667                    2.2                    2.0               1.333333                    1.2

最后,我们还可以使用这些手动定义的特征作为使用深度特征合成的自动化特征工程的输入,这在 here 中进行了解释。通过将手动定义的特征作为 seed_features 传递,ft.dfs 将自动堆叠在它们之上。

fm, feature_defs = ft.dfs(entityset=es, 
                          target_entity="matches",
                          seed_features=features, 
                          agg_primitives=[], 
                          trans_primitives=["day", "month", "year", "weekday", "percentile"])

feature_defs

[<Feature: home_team>,
 <Feature: away_team>,
 <Feature: home_goals>,
 <Feature: away_goals>,
 <Feature: label>,
 <Feature: home_team_goal_last_1>,
 <Feature: home_team_goal_last_3>,
 <Feature: home_team_goal_last_5>,
 <Feature: away_team_goal_last_1>,
 <Feature: away_team_goal_last_3>,
 <Feature: away_team_goal_last_5>,
 <Feature: DAY(match_date)>,
 <Feature: MONTH(match_date)>,
 <Feature: YEAR(match_date)>,
 <Feature: WEEKDAY(match_date)>,
 <Feature: PERCENTILE(home_goals)>,
 <Feature: PERCENTILE(away_goals)>,
 <Feature: PERCENTILE(home_team_goal_last_1)>,
 <Feature: PERCENTILE(home_team_goal_last_3)>,
 <Feature: PERCENTILE(home_team_goal_last_5)>,
 <Feature: PERCENTILE(away_team_goal_last_1)>,
 <Feature: PERCENTILE(away_team_goal_last_3)>,
 <Feature: PERCENTILE(away_team_goal_last_5)>]

特征矩阵为

         home_team away_team  home_goals  away_goals label  home_team_goal_last_1  home_team_goal_last_3  home_team_goal_last_5  away_team_goal_last_1  away_team_goal_last_3  away_team_goal_last_5  DAY(match_date)  MONTH(match_date)  YEAR(match_date)  WEEKDAY(match_date)  PERCENTILE(home_goals)  PERCENTILE(away_goals)  PERCENTILE(home_team_goal_last_1)  PERCENTILE(home_team_goal_last_3)  PERCENTILE(home_team_goal_last_5)  PERCENTILE(away_team_goal_last_1)  PERCENTILE(away_team_goal_last_3)  PERCENTILE(away_team_goal_last_5)
match_id                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
1          Arsenal   Chelsea           2           0     1                    NaN                    NaN                    NaN                    NaN                    NaN                    NaN                1                  1              2014                    2                0.666667                0.166667                                NaN                                NaN                                NaN                                NaN                                NaN                                NaN
2          Arsenal   Chelsea           1           0     1                    2.0                    NaN                    NaN                    0.0                    NaN                    NaN                2                  1              2014                    3                0.333333                0.166667                           0.590909                                NaN                                NaN                           0.227273                                NaN                                NaN
3          Arsenal   Chelsea           0           3     2                    1.0                    NaN                    NaN                    0.0                    NaN                    NaN                3                  1              2014                    4                0.125000                0.958333                           0.272727                                NaN                                NaN                           0.227273                                NaN                                NaN
4          Chelsea   Arsenal           1           1     X                    3.0               1.000000                    NaN                    0.0               1.000000                    NaN                4                  1              2014                    5                0.333333                0.500000                           0.909091                           0.333333                                NaN                           0.227273                           0.500000                                NaN
5          Chelsea   Arsenal           2           0     1                    1.0               1.333333                    NaN                    1.0               0.666667                    NaN                5                  1              2014                    6                0.666667                0.166667                           0.272727                           0.555556                                NaN                           0.590909                           0.277778                                NaN
6          Chelsea   Arsenal           2           1     1                    2.0               2.000000                    1.2                    0.0               0.333333                    0.8                6                  1              2014                    0                0.666667                0.500000                           0.590909                           0.722222                           0.571429                           0.227273                           0.111111                           0.214286
7          Arsenal   Chelsea           2           2     X                    1.0               0.666667                    0.6                    2.0               1.666667                    1.6                7                  1              2014                    1                0.666667                0.791667                           0.272727                           0.111111                           0.142857                           0.909091                           0.833333                           0.785714
8          Arsenal   Chelsea           0           1     2                    2.0               1.000000                    0.8                    2.0               2.000000                    2.0                8                  1              2014                    2                0.125000                0.500000                           0.590909                           0.333333                           0.357143                           0.909091                           1.000000                           1.000000
9          Arsenal   Chelsea           1           3     2                    0.0               1.000000                    0.8                    1.0               1.666667                    1.6                9                  1              2014                    3                0.333333                0.958333                           0.090909                           0.333333                           0.357143                           0.590909                           0.833333                           0.785714
10         Chelsea   Arsenal           3           1     1                    3.0               2.000000                    2.0                    1.0               1.000000                    0.8               10                  1              2014                    4                0.916667                0.500000                           0.909091                           0.722222                           0.714286                           0.590909                           0.500000                           0.214286
11         Chelsea   Arsenal           2           2     X                    3.0               2.333333                    2.2                    1.0               0.666667                    1.0               11                  1              2014                    5                0.666667                0.791667                           0.909091                           0.888889                           0.928571                           0.590909                           0.277778                           0.428571
12         Chelsea   Arsenal           4           1     1                    2.0               2.666667                    2.2                    2.0               1.333333                    1.2               12                  1              2014                    6                1.000000                0.500000                           0.590909                           1.000000                           0.928571                           0.909091                           0.666667                           0.571429