如何在特征工具中为具有相同 ID 和时间索引的行创建特征？

Question

我有一个这样的数据框

data = {'Customer':['C1', 'C1', 'C1', 'C2', 'C2', 'C2', 'C3', 'C3', 'C3'],
        'NumOfItems':[3, 2, 4, 5, 5, 6, 10, 6, 14],
        'PurchaseTime':["2014-01-01", "2014-01-02", "2014-01-03","2014-01-01", "2014-01-02", "2014-01-03","2014-01-01", "2014-01-02", "2014-01-03"]
       }
df = pd.DataFrame(data)
df

我想创建一个特征，例如到目前为止每个客户的最大值：

'MaxPerID(NumOfItems)':[3, 3, 4, 5, 5, 6, 10, 10, 14] #the output i want

所以我设置了 EntitySet 并将其规范化......

es = ft.EntitySet(id="customer_data")
es = es.entity_from_dataframe(entity_id="customer",
                              dataframe=df,
                              index='index',
                              time_index="PurchaseTime",
                             make_index=True)

es = es.normalize_entity(base_entity_id="customer",
                         new_entity_id="sessions",
                         index="Customer")

但是创建特征矩阵并没有产生我想要的结果。

feature_matrix, features = ft.dfs(entityset=es,
                                 target_entity="customer",
                                 agg_primitives = ["max"],
                                 max_depth = 3                                      
                                 )
feature_matrix.head

sessions.MAX(customer.NumOfItems)  
index                                                                         
0                                      4                                    
3                                      6                                    
6                                     14                                    
1                                      4                                    
4                                      6                                    
7                                     14                                    
2                                      4                                    
5                                      6                                    
8                                     14

返回的特征是所有客户每天的最大值（按时间排序），但是如果我运行相同的代码没有 time_index = "PurchaseTime" 结果是特定客户的最大值

    sessions.MAX(customer.NumOfItems)  \
index                                                                       
0                    4   
1                    4   
2                    4   
3                    6   
4                    6   
5                    6   
6                   14   
7                   14   
8                   14

我想要这两者的组合：到目前为止特定客户的最大值。这可能吗？我尝试与 es['customer']['Customer'].interesting_values =['C1', 'C2', 'C3'] 一起工作，但它并没有带我到任何地方。我还尝试修改新的规范化实体并为此编写我自己的原语。

我是 featuretools 的新手，所以非常感谢任何帮助。

Answer 1

感谢提问。您可以使用 group by transform primitive 获得预期的输出。

fm, fd = ft.dfs(
    entityset=es,
    target_entity="customer",
    groupby_trans_primitives=['cum_max'],
)

您应该获得每个客户的商品数量的累计最大值。

column = 'CUM_MAX(NumOfItems) by Customer'
actual = fm[[column]].sort_values(column)
expected = {'MaxPerID(NumOfItems)': [3, 3, 4, 5, 5, 6, 10, 10, 14]}
actual.assign(**expected)

       CUM_MAX(NumOfItems) by Customer  MaxPerID(NumOfItems)
index
0                                  3.0                     3
1                                  3.0                     3
2                                  4.0                     4
3                                  5.0                     5
4                                  5.0                     5
5                                  6.0                     6
6                                 10.0                    10
7                                 10.0                    10
8                                 14.0                    14

如何在特征工具中为具有相同 ID 和时间索引的行创建特征？

How do I create features in featuretools for rows with the same id and a time index?

python

feature-extraction

data-science

feature-engineering

featuretools