转换后如何处理模式和模型之间的不匹配?
How can I handle mismatch between schema and model after transformation?
探索 ML.Net 并且我想预测员工流动率。我有一个可用的数据集,混合了数字值和字符串值。
这只是我尝试了解 ML.net 的纯粹探索。所以我的方法是,一步一步地探索选项,这样我真的会尽可能地理解每一步。
- 加载数据
- 准备数据集并对字符串特征进行分类转换
- 应用转换后显示数据集
- 然后将数据集拆分为训练和测试数据集
- 使用分类算法训练模型
- 根据测试数据集进行评估
- 输出模型的特征权重
- 用它做一些很酷的事情
模型如下,基于 IBM 的开源损耗数据集。 https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset
型号:
public class Employee
{
[LoadColumn(0)]
public int Age { get; set; }
[LoadColumn(1)]
//[ColumnName("Label")]
public string Attrition { get; set; }
[LoadColumn(2)]
public string BusinessTravel { get; set; }
[LoadColumn(3)]
public int DailyRate { get; set; }
[LoadColumn(4)]
public string Department { get; set; }
[LoadColumn(5)]
public int DistanceFromHome { get; set; }
[LoadColumn(6)]
public int Education { get; set; }
[LoadColumn(7)]
public string EducationField { get; set; }
[LoadColumn(8)]
public int EmployeeCount { get; set; }
[LoadColumn(9)]
public int EmployeeNumber { get; set; }
[LoadColumn(10)]
public int EnvironmentSatisfaction { get; set; }
[LoadColumn(11)]
public string Gender { get; set; }
[LoadColumn(12)]
public int HourlyRate { get; set; }
[LoadColumn(13)]
public int JobInvolvement { get; set; }
[LoadColumn(14)]
public int JobLevel { get; set; }
[LoadColumn(15)]
public string JobRole { get; set; }
[LoadColumn(16)]
public int JobSatisfaction { get; set; }
[LoadColumn(17)]
public string MaritalStatus { get; set; }
[LoadColumn(18)]
public int MonthlyIncome { get; set; }
[LoadColumn(19)]
public int MonthlyRate { get; set; }
[LoadColumn(20)]
public int NumCompaniesWorked { get; set; }
[LoadColumn(21)]
public string Over18 { get; set; }
[LoadColumn(22)]
public string OverTime { get; set; }
[LoadColumn(23)]
public int PercentSalaryHike { get; set; }
[LoadColumn(24)]
public int PerformanceRating{ get; set; }
[LoadColumn(25)]
public int RelationshipSatisfaction{ get; set; }
[LoadColumn(26)]
public int StandardHours{ get; set; }
[LoadColumn(27)]
public int StockOptionLevel{ get; set; }
[LoadColumn(28)]
public int TotalWorkingYears{ get; set; }
[LoadColumn(29)]
public int TrainingTimesLastYear{ get; set; }
[LoadColumn(30)]
public int WorkLifeBalance{ get; set; }
[LoadColumn(31)]
public int YearsAtCompany{ get; set; }
[LoadColumn(32)]
public int YearsInCurrentRole{ get; set; }
[LoadColumn(33)]
public int YearsSinceLastPromotion{ get; set; }
[LoadColumn(34)]
public int YearsWithCurrManager { get; set; }
}
然后转换字符串属性(如此处解释https://docs.microsoft.com/en-us/dotnet/machine-learning/how-to-guides/prepare-data-ml-net#work-with-categorical-data)
var categoricalEstimator = mlContext.Transforms.Categorical.OneHotEncoding("Attrition")
.Append(mlContext.Transforms.Categorical.OneHotEncoding("BusinessTravel"))
.Append(mlContext.Transforms.Categorical.OneHotEncoding("EducationField"))
.Append(mlContext.Transforms.Categorical.OneHotEncoding("Gender"))
.Append(mlContext.Transforms.Categorical.OneHotEncoding("JobRole"))
.Append(mlContext.Transforms.Categorical.OneHotEncoding("MaritalStatus"))
.Append(mlContext.Transforms.Categorical.OneHotEncoding("Over18"))
.Append(mlContext.Transforms.Categorical.OneHotEncoding("OverTime"));
ITransformer categoricalTransformer = categoricalEstimator.Fit(dataView);
IDataView transformedData = categoricalTransformer.Transform(dataView);
现在我想检查发生了什么变化 (https://docs.microsoft.com/en-us/dotnet/machine-learning/how-to-guides/inspect-intermediate-data-ml-net#convert-idataview-to-ienumerable)。我现在面临的挑战是,在对字符串属性应用转换后,架构已更改并且现在包含预期的向量。
因此发生了以下情况。 Employee 模型模式不再与 transformedData 对象中的模式匹配,并尝试将 Vector 属性 放入 String 属性 并抛出以下错误 "Can't bind the IDataView column 'Attrition' of type 'Vector' to field or property 'Attrition' of type 'System.String'."
IEnumerable<Employee> employeeDataEnumerable =
mlContext.Data.CreateEnumerable<Employee>(transformedData, reuseRowObject: true);
CreateEnumerable 也有一个 SchemaDefinition 参数,所以我的第一个猜测是从 transformedData 中提取架构,并将其提供给 CreateEnumerable。然而,它需要 Microsoft.ML.DataViewSchema,而转换产生的模式是 Microsoft.ML.Data.SchemaDefinition。所以那也没用。
我希望有人可以就此提出建议。我应该做些不同的事情吗?
完全控制器操作:
public ActionResult Turnover()
{
MLContext mlContext = new MLContext();
var _appPath = AppDomain.CurrentDomain.BaseDirectory;
var _dataPath = Path.Combine(_appPath, "Datasets", "WA_Fn-UseC_-HR-Employee-Attrition.csv");
// Load data from file
IDataView dataView = mlContext.Data.LoadFromTextFile<Employee>(_dataPath, hasHeader: true);
// 0. Get the column name of input features.
string[] featureColumnNames =
dataView.Schema
.Select(column => column.Name)
.Where(columnName => columnName != "Label")
.ToArray();
// Define categorical transform estimator
var categoricalEstimator = mlContext.Transforms.Categorical.OneHotEncoding("Attrition")
.Append(mlContext.Transforms.Categorical.OneHotEncoding("BusinessTravel"))
.Append(mlContext.Transforms.Categorical.OneHotEncoding("EducationField"))
.Append(mlContext.Transforms.Categorical.OneHotEncoding("Gender"))
.Append(mlContext.Transforms.Categorical.OneHotEncoding("JobRole"))
.Append(mlContext.Transforms.Categorical.OneHotEncoding("MaritalStatus"))
.Append(mlContext.Transforms.Categorical.OneHotEncoding("Over18"))
.Append(mlContext.Transforms.Categorical.OneHotEncoding("OverTime"));
ITransformer categoricalTransformer = categoricalEstimator.Fit(dataView);
IDataView transformedData = categoricalTransformer.Transform(dataView);
// Inspect (fails because Employee (35 cols) cannot be mapped to new schema (52 cols)
IEnumerable<Employee> employeeDataEnumerable =
mlContext.Data.CreateEnumerable<Employee>(transformedData, reuseRowObject: true, schemaDefinition : transformedData.Schema);
// split the transformed dataset into training and a testing datasets
DataOperationsCatalog.TrainTestData dataSplit = mlContext.Data.TrainTestSplit(transformedData, testFraction: 0.2);
IDataView trainData = dataSplit.TrainSet;
IDataView testData = dataSplit.TestSet;
return View();
}
我最近 运行 对此进行了研究,作为一种快速解决方法,我只是创建了一个与 t运行sformed 数据模式相匹配的新 class。例如,您可以使用正确的属性(即向量而不是字符串)创建 EmoloyeeT运行sformed class 并按如下方式使用它:
CreateEnumerable<EmployeeTransformed>
如果您要创建各种 t运行sformed 模式,这不是最佳选择,但它有效。
希望对您有所帮助。
出于调试目的,您还可以调用 transformedData.Preview() 并查看数据和生成的架构。
探索 ML.Net 并且我想预测员工流动率。我有一个可用的数据集,混合了数字值和字符串值。
这只是我尝试了解 ML.net 的纯粹探索。所以我的方法是,一步一步地探索选项,这样我真的会尽可能地理解每一步。
- 加载数据
- 准备数据集并对字符串特征进行分类转换
- 应用转换后显示数据集
- 然后将数据集拆分为训练和测试数据集
- 使用分类算法训练模型
- 根据测试数据集进行评估
- 输出模型的特征权重
- 用它做一些很酷的事情
模型如下,基于 IBM 的开源损耗数据集。 https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset
型号:
public class Employee
{
[LoadColumn(0)]
public int Age { get; set; }
[LoadColumn(1)]
//[ColumnName("Label")]
public string Attrition { get; set; }
[LoadColumn(2)]
public string BusinessTravel { get; set; }
[LoadColumn(3)]
public int DailyRate { get; set; }
[LoadColumn(4)]
public string Department { get; set; }
[LoadColumn(5)]
public int DistanceFromHome { get; set; }
[LoadColumn(6)]
public int Education { get; set; }
[LoadColumn(7)]
public string EducationField { get; set; }
[LoadColumn(8)]
public int EmployeeCount { get; set; }
[LoadColumn(9)]
public int EmployeeNumber { get; set; }
[LoadColumn(10)]
public int EnvironmentSatisfaction { get; set; }
[LoadColumn(11)]
public string Gender { get; set; }
[LoadColumn(12)]
public int HourlyRate { get; set; }
[LoadColumn(13)]
public int JobInvolvement { get; set; }
[LoadColumn(14)]
public int JobLevel { get; set; }
[LoadColumn(15)]
public string JobRole { get; set; }
[LoadColumn(16)]
public int JobSatisfaction { get; set; }
[LoadColumn(17)]
public string MaritalStatus { get; set; }
[LoadColumn(18)]
public int MonthlyIncome { get; set; }
[LoadColumn(19)]
public int MonthlyRate { get; set; }
[LoadColumn(20)]
public int NumCompaniesWorked { get; set; }
[LoadColumn(21)]
public string Over18 { get; set; }
[LoadColumn(22)]
public string OverTime { get; set; }
[LoadColumn(23)]
public int PercentSalaryHike { get; set; }
[LoadColumn(24)]
public int PerformanceRating{ get; set; }
[LoadColumn(25)]
public int RelationshipSatisfaction{ get; set; }
[LoadColumn(26)]
public int StandardHours{ get; set; }
[LoadColumn(27)]
public int StockOptionLevel{ get; set; }
[LoadColumn(28)]
public int TotalWorkingYears{ get; set; }
[LoadColumn(29)]
public int TrainingTimesLastYear{ get; set; }
[LoadColumn(30)]
public int WorkLifeBalance{ get; set; }
[LoadColumn(31)]
public int YearsAtCompany{ get; set; }
[LoadColumn(32)]
public int YearsInCurrentRole{ get; set; }
[LoadColumn(33)]
public int YearsSinceLastPromotion{ get; set; }
[LoadColumn(34)]
public int YearsWithCurrManager { get; set; }
}
然后转换字符串属性(如此处解释https://docs.microsoft.com/en-us/dotnet/machine-learning/how-to-guides/prepare-data-ml-net#work-with-categorical-data)
var categoricalEstimator = mlContext.Transforms.Categorical.OneHotEncoding("Attrition")
.Append(mlContext.Transforms.Categorical.OneHotEncoding("BusinessTravel"))
.Append(mlContext.Transforms.Categorical.OneHotEncoding("EducationField"))
.Append(mlContext.Transforms.Categorical.OneHotEncoding("Gender"))
.Append(mlContext.Transforms.Categorical.OneHotEncoding("JobRole"))
.Append(mlContext.Transforms.Categorical.OneHotEncoding("MaritalStatus"))
.Append(mlContext.Transforms.Categorical.OneHotEncoding("Over18"))
.Append(mlContext.Transforms.Categorical.OneHotEncoding("OverTime"));
ITransformer categoricalTransformer = categoricalEstimator.Fit(dataView);
IDataView transformedData = categoricalTransformer.Transform(dataView);
现在我想检查发生了什么变化 (https://docs.microsoft.com/en-us/dotnet/machine-learning/how-to-guides/inspect-intermediate-data-ml-net#convert-idataview-to-ienumerable)。我现在面临的挑战是,在对字符串属性应用转换后,架构已更改并且现在包含预期的向量。
因此发生了以下情况。 Employee 模型模式不再与 transformedData 对象中的模式匹配,并尝试将 Vector 属性 放入 String 属性 并抛出以下错误 "Can't bind the IDataView column 'Attrition' of type 'Vector' to field or property 'Attrition' of type 'System.String'."
IEnumerable<Employee> employeeDataEnumerable =
mlContext.Data.CreateEnumerable<Employee>(transformedData, reuseRowObject: true);
CreateEnumerable 也有一个 SchemaDefinition 参数,所以我的第一个猜测是从 transformedData 中提取架构,并将其提供给 CreateEnumerable。然而,它需要 Microsoft.ML.DataViewSchema,而转换产生的模式是 Microsoft.ML.Data.SchemaDefinition。所以那也没用。
我希望有人可以就此提出建议。我应该做些不同的事情吗?
完全控制器操作:
public ActionResult Turnover()
{
MLContext mlContext = new MLContext();
var _appPath = AppDomain.CurrentDomain.BaseDirectory;
var _dataPath = Path.Combine(_appPath, "Datasets", "WA_Fn-UseC_-HR-Employee-Attrition.csv");
// Load data from file
IDataView dataView = mlContext.Data.LoadFromTextFile<Employee>(_dataPath, hasHeader: true);
// 0. Get the column name of input features.
string[] featureColumnNames =
dataView.Schema
.Select(column => column.Name)
.Where(columnName => columnName != "Label")
.ToArray();
// Define categorical transform estimator
var categoricalEstimator = mlContext.Transforms.Categorical.OneHotEncoding("Attrition")
.Append(mlContext.Transforms.Categorical.OneHotEncoding("BusinessTravel"))
.Append(mlContext.Transforms.Categorical.OneHotEncoding("EducationField"))
.Append(mlContext.Transforms.Categorical.OneHotEncoding("Gender"))
.Append(mlContext.Transforms.Categorical.OneHotEncoding("JobRole"))
.Append(mlContext.Transforms.Categorical.OneHotEncoding("MaritalStatus"))
.Append(mlContext.Transforms.Categorical.OneHotEncoding("Over18"))
.Append(mlContext.Transforms.Categorical.OneHotEncoding("OverTime"));
ITransformer categoricalTransformer = categoricalEstimator.Fit(dataView);
IDataView transformedData = categoricalTransformer.Transform(dataView);
// Inspect (fails because Employee (35 cols) cannot be mapped to new schema (52 cols)
IEnumerable<Employee> employeeDataEnumerable =
mlContext.Data.CreateEnumerable<Employee>(transformedData, reuseRowObject: true, schemaDefinition : transformedData.Schema);
// split the transformed dataset into training and a testing datasets
DataOperationsCatalog.TrainTestData dataSplit = mlContext.Data.TrainTestSplit(transformedData, testFraction: 0.2);
IDataView trainData = dataSplit.TrainSet;
IDataView testData = dataSplit.TestSet;
return View();
}
我最近 运行 对此进行了研究,作为一种快速解决方法,我只是创建了一个与 t运行sformed 数据模式相匹配的新 class。例如,您可以使用正确的属性(即向量而不是字符串)创建 EmoloyeeT运行sformed class 并按如下方式使用它:
CreateEnumerable<EmployeeTransformed>
如果您要创建各种 t运行sformed 模式,这不是最佳选择,但它有效。
希望对您有所帮助。
出于调试目的,您还可以调用 transformedData.Preview() 并查看数据和生成的架构。