在 python 中是否必须为隔离林设置污染值？

Question

我将构建一个模型来识别我的数据集中的异常情况。我进行了大量研究，发现隔离林是最好的。在我的数据集中，我没有任何标签（这意味着数据集仅包含解释变量）。但是我不知道在隔离林中设置污染参数（大多数解释的文章已经有输出变量[标记为异常]，使用它们计算异常值比率然后将其设置为污染值).

是否必须设置？.污染的默认值为 0.1。可以忽略它吗？如果我没有给它赋值，它会影响模型结果吗？

model = IsolationForest(contamination=0.1, n_estimators=1000)

Answer 1

不，不强制设置污染值。默认设置为“自动”。

contamination‘auto’ or float, default=’auto’ The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the scores of the samples.

Reference in documentation

因此您可以忽略它，但它 can/will 会影响模型结果，因为预测方法使用由污染值设置的阈值。

The predict method makes use of a threshold on the raw scoring function computed by the estimator. This scoring function is accessible through the score_samples method, while the threshold can be controlled by the contamination parameter.

Reference in documentation

在 python 中是否必须为隔离林设置污染值？

Is it mandatory to set contamination value for isolation forest in python?

python

outliers

scikit-learn

anomaly-detection