ML.Net LearningPipeline 总是有 10 行

ML.Net LearningPipeline always has 10 rows

我注意到无论测试或训练模型中有多少数据,SentimentAnalysis 示例项目中的 Microsoft.Ml.Legacy.LearningPipeline.Row 计数始终为 10。

https://github.com/dotnet/samples/blob/master/machine-learning/tutorials/SentimentAnalysis.sln

谁能解释一下这里 10 的意义?

// LearningPipeline allows you to add steps in order to keep everything together 
        // during the learning process.  
        // <Snippet5>
        var pipeline = new LearningPipeline();
        // </Snippet5>

        // The TextLoader loads a dataset with comments and corresponding postive or negative sentiment. 
        // When you create a loader, you specify the schema by passing a class to the loader containing
        // all the column names and their types. This is used to create the model, and train it. 
        // <Snippet6>
        pipeline.Add(new TextLoader(_dataPath).CreateFrom<SentimentData>());
        // </Snippet6>

        // TextFeaturizer is a transform that is used to featurize an input column. 
        // This is used to format and clean the data.
        // <Snippet7>
        pipeline.Add(new TextFeaturizer("Features", "SentimentText"));
        //</Snippet7>

        // Adds a FastTreeBinaryClassifier, the decision tree learner for this project, and 
        // three hyperparameters to be used for tuning decision tree performance.
        // <Snippet8>
        pipeline.Add(new FastTreeBinaryClassifier() { NumLeaves = 50, NumTrees = 50, MinDocumentsInLeafs = 20 });
        // </Snippet8>

调试器仅显示数据预览 - 前 10 行。这里的目标是展示一些示例行以及每个转换如何对它们进行操作以使调试更容易。

读取整个训练数据并运行对其进行所有转换非常昂贵,并且只有在达到 .Train() 时才会发生。由于转换仅对几行进行操作,因此在对整个数据集进行操作时,它们的效果可能会有所不同(例如,文本字典可能会更大),但希望通过完整训练在 运行 之前显示的数据预览process 有助于调试并确保将转换应用于正确的列。

如果您对如何使这个更清晰或更有用有任何想法,如果您可以在 GitHub 上创建一个问题,那就太好了!