转换后如何处理模式和模型之间的不匹配?

How can I handle mismatch between schema and model after transformation?

探索 ML.Net 并且我想预测员工流动率。我有一个可用的数据集,混合了数字值和字符串值。

这只是我尝试了解 ML.net 的纯粹探索。所以我的方法是,一步一步地探索选项,这样我真的会尽可能地理解每一步。

  1. 加载数据
  2. 准备数据集并对字符串特征进行分类转换
  3. 应用转换后显示数据集
  4. 然后将数据集拆分为训练和测试数据集
  5. 使用分类算法训练模型
  6. 根据测试数据集进行评估
  7. 输出模型的特征权重
  8. 用它做一些很酷的事情

模型如下,基于 IBM 的开源损耗数据集。 https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset

型号:

public class Employee
    {
        [LoadColumn(0)]
        public int Age { get; set; }
        [LoadColumn(1)]
        //[ColumnName("Label")]
        public string Attrition { get; set; }
        [LoadColumn(2)]
        public string BusinessTravel { get; set; }
        [LoadColumn(3)]
        public int DailyRate { get; set; }
        [LoadColumn(4)]
        public string Department { get; set; }
        [LoadColumn(5)]
        public int DistanceFromHome { get; set; }
        [LoadColumn(6)]
        public int Education { get; set; }
        [LoadColumn(7)]
        public string EducationField { get; set; }
        [LoadColumn(8)]
        public int EmployeeCount { get; set; }
        [LoadColumn(9)]
        public int EmployeeNumber { get; set; }
        [LoadColumn(10)]
        public int EnvironmentSatisfaction { get; set; }
        [LoadColumn(11)]
        public string Gender { get; set; }
        [LoadColumn(12)]
        public int HourlyRate { get; set; }
        [LoadColumn(13)]
        public int JobInvolvement { get; set; }
        [LoadColumn(14)]
        public int JobLevel { get; set; }
        [LoadColumn(15)]
        public string JobRole { get; set; }
        [LoadColumn(16)]
        public int JobSatisfaction { get; set; }
        [LoadColumn(17)]
        public string MaritalStatus { get; set; }
        [LoadColumn(18)]
        public int MonthlyIncome { get; set; }
        [LoadColumn(19)]
        public int MonthlyRate { get; set; }
        [LoadColumn(20)]
        public int NumCompaniesWorked { get; set; }
        [LoadColumn(21)]
        public string Over18 { get; set; }
        [LoadColumn(22)]
        public string OverTime { get; set; }
        [LoadColumn(23)]
        public int PercentSalaryHike { get; set; }
        [LoadColumn(24)]
        public int PerformanceRating{ get; set; }
        [LoadColumn(25)]
        public int RelationshipSatisfaction{ get; set; }
        [LoadColumn(26)]
        public int StandardHours{ get; set; }
        [LoadColumn(27)]
        public int StockOptionLevel{ get; set; }
        [LoadColumn(28)]
        public int TotalWorkingYears{ get; set; }
        [LoadColumn(29)]
        public int TrainingTimesLastYear{ get; set; }
        [LoadColumn(30)]
        public int WorkLifeBalance{ get; set; }
        [LoadColumn(31)]
        public int YearsAtCompany{ get; set; }
        [LoadColumn(32)]
        public int YearsInCurrentRole{ get; set; }
        [LoadColumn(33)]
        public int YearsSinceLastPromotion{ get; set; }
        [LoadColumn(34)]
        public int YearsWithCurrManager { get; set; }
    }

然后转换字符串属性(如此处解释https://docs.microsoft.com/en-us/dotnet/machine-learning/how-to-guides/prepare-data-ml-net#work-with-categorical-data

var categoricalEstimator = mlContext.Transforms.Categorical.OneHotEncoding("Attrition")
            .Append(mlContext.Transforms.Categorical.OneHotEncoding("BusinessTravel"))
            .Append(mlContext.Transforms.Categorical.OneHotEncoding("EducationField"))
            .Append(mlContext.Transforms.Categorical.OneHotEncoding("Gender"))
            .Append(mlContext.Transforms.Categorical.OneHotEncoding("JobRole"))
            .Append(mlContext.Transforms.Categorical.OneHotEncoding("MaritalStatus"))
            .Append(mlContext.Transforms.Categorical.OneHotEncoding("Over18"))
            .Append(mlContext.Transforms.Categorical.OneHotEncoding("OverTime"));
            ITransformer categoricalTransformer = categoricalEstimator.Fit(dataView);
            IDataView transformedData = categoricalTransformer.Transform(dataView);

现在我想检查发生了什么变化 (https://docs.microsoft.com/en-us/dotnet/machine-learning/how-to-guides/inspect-intermediate-data-ml-net#convert-idataview-to-ienumerable)。我现在面临的挑战是,在对字符串属性应用转换后,架构已更改并且现在包含预期的向量。

因此发生了以下情况。 Employee 模型模式不再与 transformedData 对象中的模式匹配,并尝试将 Vector 属性 放入 String 属性 并抛出以下错误 "Can't bind the IDataView column 'Attrition' of type 'Vector' to field or property 'Attrition' of type 'System.String'."

  IEnumerable<Employee> employeeDataEnumerable =
                    mlContext.Data.CreateEnumerable<Employee>(transformedData, reuseRowObject: true);

CreateEnumerable 也有一个 SchemaDefinition 参数,所以我的第一个猜测是从 transformedData 中提取架构,并将其提供给 CreateEnumerable。然而,它需要 Microsoft.ML.DataViewSchema,而转换产生的模式是 Microsoft.ML.Data.SchemaDefinition。所以那也没用。

我希望有人可以就此提出建议。我应该做些不同的事情吗?

完全控制器操作:

public ActionResult Turnover()
{
    MLContext mlContext = new MLContext();

    var _appPath = AppDomain.CurrentDomain.BaseDirectory;
    var _dataPath = Path.Combine(_appPath, "Datasets", "WA_Fn-UseC_-HR-Employee-Attrition.csv");

    // Load data from file
    IDataView dataView = mlContext.Data.LoadFromTextFile<Employee>(_dataPath, hasHeader: true);

    // 0. Get the column name of input features.
    string[] featureColumnNames =
        dataView.Schema
            .Select(column => column.Name)
            .Where(columnName => columnName != "Label")
            .ToArray();

    // Define categorical transform estimator
    var categoricalEstimator = mlContext.Transforms.Categorical.OneHotEncoding("Attrition")
    .Append(mlContext.Transforms.Categorical.OneHotEncoding("BusinessTravel"))
    .Append(mlContext.Transforms.Categorical.OneHotEncoding("EducationField"))
    .Append(mlContext.Transforms.Categorical.OneHotEncoding("Gender"))
    .Append(mlContext.Transforms.Categorical.OneHotEncoding("JobRole"))
    .Append(mlContext.Transforms.Categorical.OneHotEncoding("MaritalStatus"))
    .Append(mlContext.Transforms.Categorical.OneHotEncoding("Over18"))
    .Append(mlContext.Transforms.Categorical.OneHotEncoding("OverTime"));
    ITransformer categoricalTransformer = categoricalEstimator.Fit(dataView);
    IDataView transformedData = categoricalTransformer.Transform(dataView);

    // Inspect (fails because Employee (35 cols) cannot be mapped to new schema (52 cols)
    IEnumerable<Employee> employeeDataEnumerable =
        mlContext.Data.CreateEnumerable<Employee>(transformedData, reuseRowObject: true, schemaDefinition : transformedData.Schema);

    // split the transformed dataset into training and a testing datasets
    DataOperationsCatalog.TrainTestData dataSplit = mlContext.Data.TrainTestSplit(transformedData, testFraction: 0.2);
    IDataView trainData = dataSplit.TrainSet;
    IDataView testData = dataSplit.TestSet;

    return View();
}

我最近 运行 对此进行了研究,作为一种快速解决方法,我只是创建了一个与 t运行sformed 数据模式相匹配的新 class。例如,您可以使用正确的属性(即向量而不是字符串)创建 EmoloyeeT运行sformed class 并按如下方式使用它:

CreateEnumerable<EmployeeTransformed>

如果您要创建各种 t运行sformed 模式,这不是最佳选择,但它有效。

希望对您有所帮助。

出于调试目的,您还可以调用 transformedData.Preview() 并查看数据和生成的架构。