CNTK 输入数据结构例如：CSTrainingCPUOnlyExamples

Question

我正在使用 CNTK 的示例：LSTMSequenceClassifier via the Console Application: CSTrainingCPUOnlyExamples, using the default data file: Train.ctf，它看起来像这样：

输入层维度：2000 (One Hot Vector)，输出是：5 类 (Softmax ).

文件加载方式：

MinibatchSource minibatchSource = MinibatchSource.TextFormatMinibatchSource(Path.Combine(DataFolder, "Train.ctf"), streamConfigurations, MinibatchSource.InfinitelyRepeat, true);

StreamInformation featureStreamInfo = minibatchSource.StreamInfo(featuresName);

StreamInformation labelStreamInfo = minibatchSource.StreamInfo(labelsName);

我非常感谢数据文件是如何生成的，以及 2000 个输入如何映射到 5 个类输出。

当然，我的目标是编写一个应用程序来格式化并将数据保存到一个可以作为输入数据文件读取的文件中。当然，我需要了解结构才能完成这项工作。

谢谢！

我看到 Y 维度，这部分很有意义，但输入层有问题。

Answer 1

编辑： @Frank Seide MSFT

不知能否验证并给出最佳实践：

private string Format(int sequenceId, string featureName, string featureShape, string labelName, string featureComment, string labelShape, string labelComment)
{
    return $"{sequenceId} |{featureName.Replace(" ","-")} {featureShape} |# {featureComment}   |{labelName.Replace(" ","-")} {labelShape} |# {labelComment}\r\n";
}

这可能 return 类似于：

0 |x 560:1 |# I am a comment   |y 1 0 0 0 0 |# I am a comment

其中：

sequenceId = 0;
特征名称 = "x";
featureShape = "560:1";
featureComment = "I am a comment";
labelName = "y";
labelShape = "1 0 0 0 0";
labelComment = "I am a comment";

在 GPU 上，Frank 确实建议每个小批量使用大约 20 个序列，请参阅：https://www.youtube.com/watch?v=TK671HxrufE @26:25

这用于自定义 C# 数据集格式设置。

结束编辑...

一次偶然的发现，我在一些文档中找到了答案：

BrainScript CNTK Text Format Reader using CNTKTextFormatReader

documtnet继续解释：

CNTKTextFormatReader (later simply CTF Reader) is designed to consume input text data formatted according to the specification below. It supports the following main features: Multiple input streams (inputs) per file Both sparse and dense inputs Variable length sequences CNTK Text Format (CTF) Each line in the input file contains one sample for one or more inputs. Since (explicitly or implicitly) every line is also attached to a sequence, it defines one or more sequence, input, sample relations. Each input line must be formatted as follows: [Sequence_Id](Sample or Comment)+ . where Sample=|Input_Name (Value )* Comment=|# some content Each line starts with a sequence id and contains one or more samples (in other words, each line is an unordered collection of samples). Sequence id is a number. It can be omitted, in which case the line number will be used as the sequence id. Each sample is effectively a key-value pair consisting of an input name and the corresponding value vector (mapping to higher dimensions is done as part of the network itself). Each sample begins with a pipe symbol (|) followed by the input name (no spaces), followed by a whitespace delimiter and then a list of values. Each value is either a number or an index-prefixed number for sparse inputs. Both tabs and spaces can be used interchangeably as delimiters. A comment starts with a pipe immediately followed by a hash symbol: |#, then followed by the actually content (body) of the comment. The body can contain any characters, however a pipe symbol inside the body needs to be escaped by appending the hash symbol to it (see the example below). The body of a comment continues until the end of line or the next un-escaped pipe, whichever comes first.

得心应手，给个答案

The input data corresponding to the reader configuration above should look something like this: |B 100:3 123:4 |C 8 |A 0 1 2 3 4 |# a CTF comment |# another comment |A 0 1.1 22 0.3 54 |C 123917 |B 1134:1.911 13331:0.014 |C -0.001 |# a comment with an escaped pipe: '|#' |A 3.9 1.11 121.2 99.13 0.04 |B 999:0.001 918918:-9.19

Note the following about the input format: |Input_Name identifies the beginning of each input sample. This element is mandatory and is followed by the correspondent value vector. Dense vector is just a list of floating point values; sparse vector is a list of index:value tuples. Both tabs and spaces are allowed as value delimiters (within input vectors) as well as input delimiters (between inputs). Each separate line constitutes a "sequence" of length 1 ("Real" variable-length sequences are explained in the extended example below). Each input identifier can only appear once on a single line (which translates into one sample per input per line requirement). The order of input samples within a line is NOT important (conceptually, each line is an unordered collection of key-value pairs) Each well-formed line must end with either a "Line Feed" \n or "Carriage Return, Line Feed" \r\n symbols.

本视频中有关输入和标签数据的一些精彩内容：

https://youtu.be/hMRrqkl77rI - @30:23 https://youtu.be/Vi05nEzAS8Y - @25:20

另外，有帮助但没有直接关系：Read and feed data to CNTK Trainer

CNTK 输入数据结构例如：CSTrainingCPUOnlyExamples

CNTK Input Data Structure for example: CSTrainingCPUOnlyExamples

c#

cntk