有没有办法在 Ml.net K 均值聚类中使用 VarVector 表示原始数据

Question

我想对一些 'raw' 向量使用 ML.Net K 均值聚类，这些向量是我通过处理另一个数据集在内存中生成的。我希望能够在运行时间 select 向量的长度。给定模型中的所有向量的长度都相同，但随着我尝试不同的聚类方法，该长度可能因模型而异。

我使用以下代码：

public class MyVector
{
   [VectorType]
   public float[] Values;
}

void Train()
{

    var vectorSize = GetVectorSizeFromUser();

    var vectors = .... process dataset to create an array of MyVectors, each with 'vectorSize' values

    var mlContext = new MLContext();

    string featuresColumnName = "Features";
    var pipeline = mlContext
        .Transforms
        .Concatenate(featuresColumnName, nameof(MyVector.Values))
        .Append(mlContext.Clustering.Trainers.KMeans(featuresColumnName, numberOfClusters: 3));

    var trainingData = mlContext.Data.LoadFromEnumerable(vectors);

    Console.WriteLine("Training...");
    var model = pipeline.Fit(trainingData);
}

问题是，当我尝试运行训练时，出现了这个异常...

Schema mismatch for feature column 'Features': expected Vector, got VarVector (Parameter 'inputSchema')

对于 vectorSize 的任何给定值（比如 20），我可以通过使用 [VectorType(20)] 来避免这种情况，但这里的关键是我不想依赖特定的编译时值.是否有允许将动态大小的数据用于此类培训的方法？

我可以想象各种涉及使用虚拟列动态构建数据视图的讨厌的解决方法，但希望有更简单的方法。

Answer 1

感谢 Jon 找到了包含所需信息的 link 到 this 期。诀窍是在运行时覆盖 SchemaDefinition....

public class MyVector
{
   //it's not required to specify the type here since we will override in our custom schema 
   public float[] Values;
}

void Train()
{

    var vectorSize = GetVectorSizeFromUser();

    var vectors = .... process dataset to create an array of MyVectors, each with 'vectorSize' values

    var mlContext = new MLContext();

    string featuresColumnName = "Features";
    var pipeline = mlContext
        .Transforms
        .Concatenate(featuresColumnName, nameof(MyVector.Values))
        .Append(mlContext.Clustering.Trainers.KMeans(featuresColumnName, numberOfClusters: 3));

    //create a custom schema-definition that overrides the type for the Values field...  
    var schemaDef = SchemaDefinition.Create(typeof(MyVector));
    schemaDef[nameof(MyVector.Values)].ColumnType 
                  = new VectorDataViewType(NumberDataViewType.Single, vectorSize);

    //use that schema definition when creating the training dataview  
    var trainingData = mlContext.Data.LoadFromEnumerable(vectors,schemaDef);

    Console.WriteLine("Training...");
    var model = pipeline.Fit(trainingData);

    //Note that the schema-definition must also be supplied when creating the prediction engine...

    var predictor = mlContext
                    .Model
                    .CreatePredictionEngine<MyVector,ClusterPrediction>(model, 
                                          inputSchemaDefinition: schemaDef);

    //now we can use the engine to predict which cluster a vector belongs to...
    var prediction = predictor.Predict(..some MyVector...);  
}

有没有办法在 Ml.net K 均值聚类中使用 VarVector 表示原始数据

Is there a way to use VarVector to represent raw data in Ml.net K-means clustering

c#

ml.net