当我只有一个数据框时，如何使用特征工具通过 dfs 获得自动特征？

Question

我想弄清楚 Featuretools 是如何工作的，我正在 Kaggle 上的房价数据集上测试它。由于数据集很大，我在这里只使用其中的一组。

数据帧是：

train=pd.DataFrame({
'Id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}, 
'MSSubClass': {0: 60, 1: 20, 2: 60, 3: 70, 4: 60}, 
'MSZoning': {0: 'RL', 1: 'RL', 2: 'RL', 3: 'RL', 4: 'RL'}, 
'LotFrontage': {0: 65.0, 1: 80.0, 2: 68.0, 3: 60.0, 4: 84.0}, 
'LotArea': {0: 8450, 1: 9600, 2: 11250, 3: 9550, 4: 14260}
})

我设置了数据框属性：

dataframes = {'train': (train, 'Id')}

然后调用dfs方法：

train_feature_matrix, train_feature_names = ft.dfs(dataframes=dataframes, target_dataframe_name='train', max_depth=10, agg_primitives=["mean", "sum", "mode"])

我收到以下警告：

UnusedPrimitiveWarning: Some specified primitives were not used during DFS: agg_primitives: ['mean', 'mode', 'sum'] This may be caused by a using a value of max_depth that is too small, not setting interesting values, or it may indicate no compatible columns for the primitive were found in the data. If the DFS call contained multiple instances of a primitive in the list above, none of them were used. warnings.warn(warning_msg, UnusedPrimitiveWarning)

并且 train_feature_matrix 与原始 train 数据帧完全相同。

一开始我说这是因为我的dataframe很小，提取不出有用的东西。但是我对整个数据框（80 列和 1460 行）得到了相同的行为。

我在 Featuretools 页面上看到的每个示例都有 2 个以上的数据框，但我只有一个。

你能在这里阐明一下吗？我做错了什么？

Answer 1

聚合基元无法在具有单个 DataFrame 的 EntitySet 上创建特征。

这是因为它们执行的聚合发生在 one-to-many 关系之上，当您在 EntitySet 中的 DataFrames 之间存在 parent-child 关系时。关于基元的 Featuretools 指南有一个部分解释了差异 here。对于您的数据，这可能看起来像一个子 DataFrame，上面有一个 non-unique house_id 列。然后，train DataFrame 上的运行 dfs 会为每个 Id 汇总所需的信息，每次它出现在子 DataFrame 中时都会使用。

要使用单个 DataFrame 自动生成特征，您应该使用转换特征。可以在 here.

中找到可用的 Transform Primitives

当我只有一个数据框时，如何使用特征工具通过 dfs 获得自动特征？

How can I get automatical features with dfs, using featuretools, when I have only one dataframe?

python

feature-extraction

pandas

featuretools