如何在 H2O-R 中创建异常检测模型

How to create a model for anomaly detection in H2O-R

我正在尝试 运行 H2O 在 R 中的异常检测 (h2o_3.14.0.2).

首先,我尝试使用我的主要深度学习模型并得到错误:

water.exceptions.H2OIllegalArgumentException
 [1] "water.exceptions.H2OIllegalArgumentException: Only for AutoEncoder Deep Learning model."
 ...

好吧,我的错。我已将 autoencoder 设置为 TRUE:

h2o.deeplearning(y = response, training_frame = training.frame, validation_frame = test.frame, autoencoder = TRUE)

并出现新错误:

Error in .verify_dataxy(training_frame, x, y, autoencoder): `y` should not be specified for autoencoder=TRUE, remove `y` input
Traceback:

1. h2o.deeplearning(y = response, training_frame = training.frame, 
 .     validation_frame = test.frame, autoencoder = TRUE)
2. .verify_dataxy(training_frame, x, y, autoencoder)
3. stop("`y` should not be specified for autoencoder=TRUE, remove `y` input")

好的,所以我应该删除 y:

h2o.deeplearning(training_frame = training.frame, validation_frame = test.frame, autoencoder = TRUE)

但是:

Error in is.numeric(y): argument "y" is missing, with no default
Traceback:

1. h2o.deeplearning(training_frame = training.frame, validation_frame = test.frame, 
 .     autoencoder = TRUE)
2. is.numeric(y)

嗯,最后两个要求看起来互斥。不过好吧,我会尝试另一个模型:

anomaly.detection.model <- h2o.glrm(training_frame = training.frame, k = 10, seed = common.seed)

h2o.anomaly(anomaly.detection.model, training.frame, per_feature = FALSE)

并得到另一种类型的错误:

java.lang.AssertionError
 [1] "java.lang.AssertionError"                                                                                    
 [2] "    water.api.ModelMetricsHandler.predict(ModelMetricsHandler.java:439)"
 ...

失败的断言是assert s.reconstruct_train;。还没深究。也许我会在 GBM 或 RF 方面走运?

model = h2o.gbm(y = response,
                training_frame = training.frame,
                validation_frame = validation.frame,
                max_hit_ratio_k = 10,
                seed = common.seed,
                stopping_rounds = 3,
                stopping_tolerance = 1e-2)

h2o.anomaly(model, training.frame, per_feature = FALSE)

water.exceptions.H2OIllegalArgumentException
 [1] "water.exceptions.H2OIllegalArgumentException: Requires a Deep Learning, GLRM, DRF or GBM model."

RF 也一样。

所以我有两个问题:

  1. 如何检测异常?
  2. 这些是错误还是我做错了什么?

启用自动编码器(为 TRUE)成为聚类问题,因此无需设置响应 (y)。

此外,当自动编码器设置为 TRUE 时,您仍然需要设置 x。您在上面看到的自动编码器问题是真的,您没有设置预测变量 (x)。一旦你设置了 x 你的问题就会消失。

这是我在 R 上使用 H2O 3.14.0.2 运行 进行的快速异常检测测试(在此 blog 中了解更多信息):

  > library(h2o)
  > h2o.init()
  Reading in config file: ./.h2oconfig

  H2O is not running yet, starting it now...

  Note:  In case of errors look at the following log files:
      /var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T//Rtmp7RuYKp/h2o_avkashchauhan_started_from_r.out
      /var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T//Rtmp7RuYKp/h2o_avkashchauhan_started_from_r.err

  java version "1.8.0_101"
  Java(TM) SE Runtime Environment (build 1.8.0_101-b13)
  Java HotSpot(TM) 64-Bit Server VM (build 25.101-b13, mixed mode)

  Starting H2O JVM and connecting: .. Connection successful!

  R is connected to the H2O cluster: 
      H2O cluster uptime:         1 seconds 948 milliseconds 
      H2O cluster version:        3.14.0.2 
      H2O cluster version age:    24 days  
      H2O cluster name:           H2O_started_from_R_avkashchauhan_alj381 
      H2O cluster total nodes:    1 
      H2O cluster total memory:   3.56 GB 
      H2O cluster total cores:    8 
      H2O cluster allowed cores:  8 
      H2O cluster healthy:        TRUE 
      H2O Connection ip:          localhost 
      H2O Connection port:        54321 
      H2O Connection proxy:       NA 
      H2O Internal Security:      FALSE 
      H2O API Extensions:         XGBoost, Algos, AutoML, Core V3, Core V4 
      R Version:                  R version 3.4.0 (2017-04-21) 

  > mtcar = h2o.importFile('https://raw.githubusercontent.com/woobe/H2O_London_Workshop/master/data/auto_design.csv')
    |==================================================================================================================================| 100%
  > mtcar$gear = as.factor(mtcar$gear)
  > mtcar$carb = as.factor(mtcar$carb)
  > mtcar$cyl = as.factor(mtcar$cyl)
  > mtcar$vs = as.factor(mtcar$vs)
  > mtcar$am = as.factor(mtcar$am)
  > mtcar.dl = h2o.deeplearning(x = 2:12, training_frame = mtcar, autoencoder = TRUE, hidden = c(1,1,1), epochs = 100,seed=1)
    |==================================================================================================================================| 100%
  > errors <- h2o.anomaly(mtcar.dl, mtcar, per_feature = TRUE)
  > print(errors)
    reconstr_carb.1.SE reconstr_carb.2.SE reconstr_carb.3.SE reconstr_carb.4.SE reconstr_carb.6.SE reconstr_carb.8.SE
  1                  0                  0                  0                  1                  0                  0
  2                  0                  0                  0                  1                  0                  0
  3                  1                  0                  0                  0                  0                  0
  4                  1                  0                  0                  0                  0                  0
  5                  0                  1                  0                  0                  0                  0
  6                  1                  0                  0                  0                  0                  0
    reconstr_carb.missing(NA).SE reconstr_cyl.4.SE reconstr_cyl.6.SE reconstr_cyl.8.SE reconstr_cyl.10.SE reconstr_cyl.missing(NA).SE
  1                            0                 0                 1                 0                  0                           0
  2                            0                 0                 1                 0                  0                           0
  3                            0                 1                 0                 0                  0                           0
  4                            0                 0                 1                 0                  0                           0
  5                            0                 0                 0                 1                  0                           0
  6                            0                 0                 1                 0                  0                           0
    reconstr_gear.3.SE reconstr_gear.4.SE reconstr_gear.5.SE reconstr_gear.missing(NA).SE reconstr_vs.0.SE reconstr_vs.1.SE
  1                  0                  1                  0                            0                1                0
  2                  0                  1                  0                            0                1                0
  3                  0                  1                  0                            0                0                1
  4                  1                  0                  0                            0                0                1
  5                  1                  0                  0                            0                1                0
  6                  1                  0                  0                            0                0                1
    reconstr_vs.missing(NA).SE reconstr_am.0.SE reconstr_am.1.SE reconstr_am.missing(NA).SE reconstr_mpg.SE reconstr_disp.SE reconstr_hp.SE
  1                          0                0                1                          0    8.705556e-05     0.0196626269   0.0035177471
  2                          0                0                1                          0    8.705556e-05     0.0196626269   0.0035177471
  3                          0                0                1                          0    2.684331e-04     0.0411916382   0.0045768080
  4                          0                1                0                          0    1.307597e-05     0.0004837585   0.0035177471
  5                          0                1                0                          0    1.779785e-03     0.0102131519   0.0007516691
  6                          0                1                0                          0    2.576469e-03     0.0038200199   0.0038147898
    reconstr_drat.SE reconstr_wt.SE reconstr_qsec.SE
  1      0.002147682    0.002080628      0.003914459
  2      0.002147682    0.002054817      0.003843678
  3      0.002153499    0.002111200      0.003646228
  4      0.002244072    0.002020654      0.003545225
  5      0.002235761    0.001998203      0.003843678
  6      0.002282261    0.001996213      0.003451600

  [32 rows x 28 columns]

您也可以在与下面相同的数据集上执行 GLRM,您必须设置 k 并且不需要使用 GLRM 传递 x,但是数据集不能有常量列。这就是为什么我在深度学习中使用带 GLRM 的过滤数据集。

> mtcar_glrm = mtcar[2:12]
> mtcar.glrm = h2o.glrm(training_frame = mtcar_glrm,seed=1, k = 5)

我尝试自己检测时间序列数据的异常。为了学习这个概念,我使用了这个 blog。这个博客中的解释对我来说很管用。

我希望在我们检测到异常时提供一些可视化表示。 在示例中,深度学习模型适用于此 ECG 数据集。数据看起来像这样:

Data we fit our Deep Learning Model

之后我们提供如下所示的测试数据集(包含异常): Data we test our Deep Learning Model on

当 'Artificial Intelligence' 使用度量 MSE 或均方误差

发现差异时,异常检测本身是可能的

This is what AI 'see' on Test dataset

生成的MSE可以按照例子得到

MSE output