ML.NET Concatenate 的真正作用是什么?

What ML.NET Concatenate really does?

我相信我明白什么时候需要调用 Concatenate,针对什么数据以及为什么。我想了解的是调用 Concatenate 时输入列数据发生的物理变化。

这是对列中的所有输入数据进行哈希处理并生成结果的某种哈希函数吗?

换句话说,我想知道从连接生成的值中恢复原始值在技术上是否可行?

传递给 Concatenate 的数据列的顺序是否会影响生成的模型,以何种方式影响?

为什么我要问这么多。我试图了解哪些输入参数以及以何种方式影响生成模型的质量。我有很多输入数据列。它们都很重要,而且这些值之间的关系也很重要。如果 Concatenate 做了一些简单的事情并失去了值之间的关系,我会尝试一种方法来提高模型的质量。如果它相当复杂并保留值的详细信息,我会使用其他方法。

在 ML.NET 中,Concatenate 采用单独的特征(相同类型)并创建 feature vector

In pattern recognition and machine learning, a feature vector is an n-dimensional vector of numerical features that represent some object. Many algorithms in machine learning require a numerical representation of objects, since such representations facilitate processing and statistical analysis. When representing images, the feature values might correspond to the pixels of an image, while when representing texts the features might be the frequencies of occurrence of textual terms. Feature vectors are equivalent to the vectors of explanatory variables used in statistical procedures such as linear regression.

据我了解,不涉及散列。从概念上讲,您可以将其视为 String.Join 方法,您可以在其中获取单个元素并将它们合并为一个。在这种情况下,该单个组件是一个特征向量,它作为一个整体将基础数据表示为类型 T 的数组,其中 T 是各个列的数据类型。

因此,您始终可以访问各个组件,顺序无关紧要。

下面是一个使用 F# 获取数据、使用串联转换创建特征向量并访问各个组件的示例:

#r "nuget:Microsoft.ML"

open Microsoft.ML
open Microsoft.ML.Data

// Raw data
let housingData = 
    seq {
        {| NumRooms = 3f; NumBaths = 2f ; SqFt = 1200f|}
        {| NumRooms = 2f; NumBaths = 1f ; SqFt = 800f|}
        {| NumRooms = 6f; NumBaths = 7f ; SqFt = 5000f|}
    }

// Initialize MLContext
let ctx = new MLContext()

// Load data into IDataView
let dataView = ctx.Data.LoadFromEnumerable(housingData)

// Get individual column names (NumRooms, NumBaths, SqFt)
let columnNames = 
    dataView.Schema 
    |> Seq.map(fun col -> col.Name)
    |> Array.ofSeq

// Create pipeline with concatenate transform
let pipeline = ctx.Transforms.Concatenate("Features", columnNames)

// Fit data to pipeline and apply transform
let transformedData = pipeline.Fit(dataView).Transform(dataView)

// Get "Feature" column containing the result of applying Concatenate transform
let features = transformedData.GetColumn<float32 array>("Features")

// Deconstruct feature vector and print out individual features
printfn "Rooms | Baths | Sqft"
for [|rooms;baths;sqft|] in features do
    printfn $"{rooms} | {baths} | {sqft}"

控制台输出的结果为:

Rooms | Baths | Sqft
2 | 3 | 1200
1 | 2 | 800
7 | 6 | 5000

如果您想了解各个特征对模型的影响,我建议您查看 Permutation Feature Importance (PFI) and Feature Contribution Calculation