如何在 运行 期间使用 ML.Net 指定 Kmeans 聚类中的特征数量(向量类型)

How to specify during the run the Number of Features (Vector Type) in Kmeans Clustering with ML.Net

我想使用 ML.Net Kmeans 算法,但在编译期间我不知道数据集的大小,也就是特征的数量。

我看到向量类型 length 应该是一个 const,因此尝试作为参数传递是行不通的。

class Data
{ 
    public string ID{ get; set; }

    [VectorType(5)] //I do not know the if the data will contain 5 or more features
    public float[] Features { get; set; }   
}

待用:

InputData row = new InputData { AssetID = Data[0, i + 1].ToString(), Features = features };

var context = new MLContext();
var DataView = context.Data.LoadFromEnumerable(dataArray);
string featuresColumnName = "Features";
var pipeline=context.Transforms.Concatenate(featuresColumnName,"Features")             .Append(context.Clustering.Trainers.KMeans(featuresColumnName, clustersCount: NumberClusters));

var model = pipeline.Fit(DataView);

如果向量的维度是固定的,你可以在运行时变通:

 private class SampleTemperatureDataVector
    {
        public DateTime Date { get; set; }
        public float[] Temperature { get; set; }
    }

注意这个类型没有注释。您可以从中创建 SchemaDefinition,而不是修改该架构。初始 SchemaDefinition 会将 IsKnownSize 属性 设置为 false。修改后 Size 将设置为您设置的维度,在本例中为 3。

        var data2 = new SampleTemperatureDataVector[]
        {
            new SampleTemperatureDataVector
            {
                Date = DateTime.UtcNow, 
                Temperature = new float[] {1.2f, 3.4f, 5.6f}
            },
             new SampleTemperatureDataVector
            {
                Date = DateTime.UtcNow,
                Temperature = new float[] {1.2f, 3.4f, 5.6f}
            },
        };

        int featureDimension = 3;
        var autoSchema = SchemaDefinition.Create(typeof(SampleTemperatureDataVector));
        var featureColumn = autoSchema[1];
        var itemType = ((VectorDataViewType)featureColumn.ColumnType).ItemType;
        featureColumn.ColumnType = new VectorDataViewType(itemType, featureDimension);

        IDataView data3 = mlContext.Data.LoadFromEnumerable(data2, autoSchema);