一维数组或列表的隔离森林 Sklearn 以及如何调整超参数

Question

有没有办法为一维数组或列表实现 sklearn 隔离林？我遇到的所有示例都是针对二维或更多维度的数据。

我现在开发了一个具有三个特征的模型，下面提到了示例代码片段：

# dataframe of three columns
df_data = datafr[['col_A', 'col_B', 'col_C']]
w_train = page_data[:700]
w_test = page_data[700:-2]

from sklearn.ensemble import IsolationForest
# fit the model
clf = IsolationForest(max_samples='auto')
clf.fit(w_train)

#testing it using test set
y_pred_test = clf.predict(w_test)

我主要依赖的参考：IsolationForest example | scikit-learn

df_data 是一个包含三列的数据框。我实际上是在寻找一维或列表数据中的异常值。

另一个问题是如何调整隔离森林模型？其中一种方法是增加污染值以减少误报。但是如何使用其他参数，如 n_estimators、max_samples、max_features、versbose 等

Answer 1

将隔离林应用于一维数组或列表没有意义。这是因为在那种情况下，它只是从特征到目标的一对一映射。

您可以阅读 the official documentation 以更好地了解不同的参数帮助

污染

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.

尝试使用 [0,0.5] 范围内的不同值进行试验，看看哪个值的效果最好

max_features

The number of features to draw from X to train each base estimator.

尝试使用 5、6、10 等任何您选择的整数值，并使用最终测试数据对其进行验证

n_estimators 尝试多个值，如 10、20、50 等，看看哪个最有效。

您还可以使用 GridSearchCV 来自动执行此参数估计过程。

只需尝试使用 gridSearchCV 尝试不同的值，看看哪个值的效果最好。

试试这个

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score, make_scorer

my_scoring_func = make_scorer(f1_score)
parameters = {'n_estimators':[10,30,50,80], 'max_features':[0.1, 0.2, 0.3,0.4], 'contamination' : [0.1, 0.2, 0.3]}
iso_for =  IsolationForest(max_samples='auto')
clf = GridSearchCV(iso_for, parameters,  scoring=my_scoring_func)

然后用clf拟合数据。尽管请注意 GridSearchCV 需要 bot x 和 y（即训练数据和标签）用于 fit 方法。

注意：如果您希望将 GridSearchCv 与隔离林一起使用，您可以阅读 this blog post 以进一步参考，否则您可以手动尝试使用不同的值并绘制图表以查看结果。

一维数组或列表的隔离森林 Sklearn 以及如何调整超参数

Isolation Forest Sklearn for 1D array or list and how to tune hyper parameters

python

algorithm

machine-learning

scikit-learn

anomaly-detection