是否可以仅为测试数据计算特征矩阵？

Question

我有超过 100,000 行带时间戳的训练数据，想为新的测试数据计算一个特征矩阵，其中只有 10 行。测试数据中的一些特征最终会聚合一些训练数据。我需要快速实施，因为这是实时推理管道中的一个步骤。

我可以想到两种实现方式：

连接训练和测试实体集和运行 DFS，然后仅使用最后 10 行并丢弃其余行。这非常耗时。有没有办法在使用整个实体集的数据的同时计算实体集的子集？
使用 Featuretools 部署页面上 Calculating Feature Matrix for New Data 部分概述的步骤。但是，如下所示，这似乎不起作用。

创建all/train/test个实体集：

import featuretools as ft

data = ft.demo.load_mock_customer(n_customers=3, n_sessions=15)
    
df_sessions = data['sessions']
    
# Create all/train/test entity sets.
all_es = ft.EntitySet(id='sessions')
train_es = ft.EntitySet(id='sessions')
test_es = ft.EntitySet(id='sessions')
    
all_es = all_es.entity_from_dataframe(
    entity_id='sessions',
    dataframe=df_sessions,  # all sessions
    index='session_id',
    time_index='session_start',
)
    
train_es = train_es.entity_from_dataframe(
    entity_id='sessions',
    dataframe=df_sessions.iloc[:10],  # first 10 sessions
    index='session_id',
    time_index='session_start',
)
    
test_es = test_es.entity_from_dataframe(
    entity_id='sessions',
    dataframe=df_sessions.iloc[10:],  # last 5 sessions
    index='session_id',
    time_index='session_start',
)
    
# Normalise customer entities so we can group by customers.
all_es = all_es.normalize_entity(base_entity_id='sessions',
                                 new_entity_id='customers',
                                 index='customer_id')

train_es = train_es.normalize_entity(base_entity_id='sessions',
                                     new_entity_id='customers',
                                     index='customer_id')

test_es = test_es.normalize_entity(base_entity_id='sessions',
                                   new_entity_id='customers',
                                   index='customer_id')

设置cutoff_time因为我们正在处理带有时间戳的数据：

cutoff_time = (df_sessions
               .filter(['session_id', 'session_start'])
               .rename(columns={'session_id': 'instance_id',
                                'session_start': 'time'}))

计算所有数据的特征矩阵：

feature_matrix, features_defs = ft.dfs(entityset=all_es,
                                       cutoff_time=cutoff_time,
                                       target_entity='sessions')
    
display(feature_matrix.filter(['customer_id', 'customers.COUNT(sessions)']))

session_id	customer_id	customers.COUNT（会话数）
1	3	1
2	3	2
3	1	1
4	2	1
5	2	2
6	2	3
7	2	4
8	1	2
9	2	5
10	1	3
11	1	4
12	2	6
13	3	3
14	1	5
15	3	4

计算训练数据的特征矩阵：

feature_matrix, features_defs = ft.dfs(entityset=train_es,
                                       cutoff_time=cutoff_time.iloc[:10],
                                       target_entity='sessions')
display(feature_matrix.filter(['customer_id', 'customers.COUNT(sessions)']))

session_id	customer_id	customers.COUNT（会话数）
1	3	1
2	3	2
3	1	1
4	2	1
5	2	2
6	2	3
7	2	4
8	1	2
9	2	5
10	1	3

计算测试数据的特征矩阵（使用 Featuretools 部署页面上“新数据的特征矩阵”中显示的方法）：

feature_matrix = ft.calculate_feature_matrix(features=features_defs,
                                                      entityset=test_es,
                                                      cutoff_time=cutoff_time.iloc[10:])
display(feature_matrix.filter(['customer_id', 'customers.COUNT(sessions)']))

session_id	customer_id	customers.COUNT（会话数）
11	1	1
12	2	1
13	3	1
14	1	2
15	3	2

如您所见，从 train_es 生成的特征矩阵与从 all_es 生成的特征矩阵的前 10 行匹配。但是，从 test_es 生成的特征矩阵与从 all_es.

生成的特征矩阵的相应行不匹配

Answer 1

您可以使用 cutoff_time 数据框（如果截止时间是单个日期时间，则可以控制 DFS 中的 instance_ids 参数）为哪些实例生成特征。 Featuretools 只会为 ID 在截止时间数据框中的实例生成特征，而忽略所有其他实例：

feature_matrix, features_defs = ft.dfs(entityset=all_es,
                                       cutoff_time=cutoff_time[10:],
                                       target_entity='sessions')
    
display(feature_matrix.filter(['customer_id', 'customers.COUNT(sessions)']))

customer_id	customers.COUNT(sessions)	session_id
1	4
2	6
3	3
1	5
3	4

“新数据的特征矩阵”中的方法在您想要计算相同特征但对全新数据时很有用。将创建所有相同的功能，但实体集之间不共享数据。这在这种情况下不起作用，因为目标是使用所有数据但只为特定实例生成特征。

是否可以仅为测试数据计算特征矩阵？

Is it possible to calculate a feature matrix only for test data?

pandas

featuretools