Deeplearning4j - 如何迭代大数据的多个数据集?

Deeplearning4j - how to iterate multiple DataSets for large data?

我正在研究用于构建神经网络的 Deeplearning4j(版本 1.0.0-M1.1)。

我以 Deeplearning4j 的 IrisClassifier 为例,它工作正常:

//First: get the dataset using the record reader. CSVRecordReader handles loading/parsing
int numLinesToSkip = 0;
char delimiter = ',';
RecordReader recordReader = new CSVRecordReader(numLinesToSkip,delimiter);
recordReader.initialize(new FileSplit(new File(DownloaderUtility.IRISDATA.Download(),"iris.txt")));

//Second: the RecordReaderDataSetIterator handles conversion to DataSet objects, ready for use in neural network
int labelIndex = 4;     //5 values in each row of the iris.txt CSV: 4 input features followed by an integer label (class) index. Labels are the 5th value (index 4) in each row
int numClasses = 3;     //3 classes (types of iris flowers) in the iris data set. Classes have integer values 0, 1 or 2
int batchSize = 150;    //Iris data set: 150 examples total. We are loading all of them into one DataSet (not recommended for large data sets)

DataSetIterator iterator = new RecordReaderDataSetIterator(recordReader,batchSize,labelIndex,numClasses);
DataSet allData = iterator.next();
allData.shuffle();
SplitTestAndTrain testAndTrain = allData.splitTestAndTrain(0.65);  //Use 65% of data for training

DataSet trainingData = testAndTrain.getTrain();
DataSet testData = testAndTrain.getTest();

//We need to normalize our data. We'll use NormalizeStandardize (which gives us mean 0, unit variance):
DataNormalization normalizer = new NormalizerStandardize();
normalizer.fit(trainingData);           //Collect the statistics (mean/stdev) from the training data. This does not modify the input data
normalizer.transform(trainingData);     //Apply normalization to the training data
normalizer.transform(testData);         //Apply normalization to the test data. This is using statistics calculated from the *training* set

final int numInputs = 4;
int outputNum = 3;
long seed = 6;

log.info("Build model....");
MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
    .seed(seed)
    .activation(Activation.TANH)
    .weightInit(WeightInit.XAVIER)
    .updater(new Sgd(0.1))
    .l2(1e-4)
    .list()
    .layer(new DenseLayer.Builder().nIn(numInputs).nOut(3)
        .build())
    .layer(new DenseLayer.Builder().nIn(3).nOut(3)
        .build())
    .layer( new OutputLayer.Builder(LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD)
        .activation(Activation.SOFTMAX) //Override the global TANH activation with softmax for this layer
        .nIn(3).nOut(outputNum).build())
    .build();

//run the model
MultiLayerNetwork model = new MultiLayerNetwork(conf);
model.init();
//record score once every 100 iterations
model.setListeners(new ScoreIterationListener(100));

for(int i=0; i<1000; i++ ) {
    model.fit(trainingData);
}

//evaluate the model on the test set
Evaluation eval = new Evaluation(3);
INDArray output = model.output(testData.getFeatures());

eval.eval(testData.getLabels(), output);
log.info(eval.stats());

对于我的项目,我输入了大约 30000 条记录(在 iris 示例中为 150)。 每条记录都是一个大小为 ~7000 的向量(在 iris 示例中为 4)。

显然,我无法在一个 DataSet 中处理所有数据 - 这将为 JVM 产生 OOM。

如何处理多个数据集中的数据?

我假设它应该是这样的(将数据集存储在列表中并迭代):

...
    DataSetIterator iterator = new RecordReaderDataSetIterator(recordReader,batchSize,labelIndex,numClasses);
    List<DataSet> trainingData = new ArrayList<>();
    List<DataSet> testData = new ArrayList<>();

    while (iterator.hasNext()) {
        DataSet allData = iterator.next();
        allData.shuffle();
        SplitTestAndTrain testAndTrain = allData.splitTestAndTrain(0.65);  //Use 65% of data for training
        trainingData.add(testAndTrain.getTrain());
        testData.add(testAndTrain.getTest());
    }
    //We need to normalize our data. We'll use NormalizeStandardize (which gives us mean 0, unit variance):
    DataNormalization normalizer = new NormalizerStandardize();
    for (DataSet dataSetTraining : trainingData) {
        normalizer.fit(dataSetTraining);           //Collect the statistics (mean/stdev) from the training data. This does not modify the input data
        normalizer.transform(dataSetTraining);     //Apply normalization to the training data
    }
    for (DataSet dataSetTest : testData) {
        normalizer.transform(dataSetTest);         //Apply normalization to the test data. This is using statistics calculated from the *training* set
    }

...

    for(int i=0; i<1000; i++ ) {
        for (DataSet dataSetTraining : trainingData) {
            model.fit(dataSetTraining);
        }
    }

但是当我开始评估时,我得到了这个错误:

Exception in thread "main" java.lang.NullPointerException: Cannot read field "javaShapeInformation" because "this.jvmShapeInfo" is null
    at org.nd4j.linalg.api.ndarray.BaseNDArray.dataType(BaseNDArray.java:5507)
    at org.nd4j.linalg.api.ndarray.BaseNDArray.validateNumericalArray(BaseNDArray.java:5575)
    at org.nd4j.linalg.api.ndarray.BaseNDArray.add(BaseNDArray.java:3087)
    at com.aarcapital.aarmlclassifier.classification.FAClassifierLearning.main(FAClassifierLearning.java:117)

...

    Evaluation eval = new Evaluation(26);

    INDArray output = new NDArray();
    for (DataSet dataSetTest : testData) {
        output.add(model.output(dataSetTest.getFeatures())); // ERROR HERE
    }

    System.out.println("--- Output ---");
    System.out.println(output);

    INDArray labels = new NDArray();
    for (DataSet dataSetTest : testData) {
        labels.add(dataSetTest.getLabels());
    }

    System.out.println("--- Labels ---");
    System.out.println(labels);

    eval.eval(labels, output);
    log.info(eval.stats());

为学习网络迭代 miltiple DataSet 的正确方法是什么?

谢谢!

首先,始终对 ndarray 使用 Nd4j.create(..)。 永远不要使用实现。这允许您安全地创建无论您使用 cpus 还是 gpus 都可以工作的 ndarrays。

第二:始终使用 RecordReaderDataSetIterator 的构建器而不是构造器。它很长而且容易出错。

这就是我们首先制作构建器的原因。

您的 NullPointer 实际上并非来自您认为的位置。这是由于您如何创建 ndarray。没有数据类型或任何东西,所以它不知道会发生什么。 Nd4j.create(..) 将为您正确设置 ndarray。

除此之外,您做事的方式是正确的。记录 reader 为您处理批处理。