您可以在大样本量上使用隔离森林算法吗?

Can you use the isolation forest algorithm on large sample sizes?

我一直在使用隔离林的 scikit learn sklearn.ensemble.IsolationForest 实现来检测我的数据集中的异常情况,范围从 100 行到数百万行的数据。它似乎运行良好,我已经将 max_samples 覆盖为一个非常大的整数来处理我的一些较大的数据集(基本上不使用子采样)。我注意到 original paper 指出较大的样本量会产生淹没和掩蔽的风险。

如果看起来效果不错,可以在大样本量上使用隔离林吗?我尝试使用较小的 max_samples 进行训练,但测试产生了太多异常。我的数据真的开始增长了,我想知道对于如此大的样本量,是否有不同的异常检测算法会更好。

引用原文:

The isolation characteristic of iTrees enables them to build partial models and exploit sub-sampling to an extent that is not feasible in existing methods. Since a large part of an iTree that isolates normal points is not needed for anomaly detection; it does not need to be constructed. A small sample size produces better iTrees because the swamping and masking effects are reduced.

从你的问题来看,我觉得你混淆了数据集的大小和你从中提取的用于构建 iTree 的样本的大小。隔离林可以处理非常大的数据集。它对它们进行采样时效果更好。

原论文在第 3 章中对此进行了讨论:

The data set has two anomaly clusters located close to one large cluster of normal points at the centre. There are interfering normal points surrounding the anomaly clusters, and the anomaly clusters are denser than normal points in this sample of 4096 instances. Figure 4(b) shows a sub-sample of 128 instances of the original data. The anomalies clusters are clearly identifiable in the sub-sample. Those normal instances surrounding the two anomaly clusters have been cleared out, and the size of anomaly clusters becomes smaller which makes them easier to identify. When using the entire sample, iForest reports an AUC of 0.67. When using a sub-sampling size of 128, iForest achieves an AUC of 0.91.

孤立森林不是一个完美的算法,需要针对您的特定数据进行参数调整。它甚至可能在某些数据集上表现不佳。如果您想考虑其他方法,Local Outlier Factor 也包含在 sklearn 中。您也可以组合几种方法(集成)。

在这里您可以找到不错的 comparison 不同的方法。