如何使用 One Hot 编码删除虚拟变量陷阱
How do I remove dummy variable trap with OneHotencoding
这是我的 CSV 数据提取和转换代码:
Schema schema = new Schema.Builder()
.addColumnsString("RowNumber")
.addColumnInteger("CustomerId")
.addColumnString("Surname")
.addColumnInteger("CreditScore")
.addColumnCategorical("Geography",Arrays.asList("France","Spain","Germany"))
.addColumnCategorical("Gender",Arrays.asList("Male","Female"))
.addColumnsInteger("Age","Tenure","Balance","NumOfProducts","HasCrCard","IsActiveMember","EstimatedSalary","Exited").build();
TransformProcess transformProcess = new TransformProcess.Builder(schema)
.removeColumns("RowNumber","Surname","CustomerId")
.categoricalToInteger("Gender")
.categoricalToOneHot("Geography").build();
RecordReader reader = new CSVRecordReader(1,',');
reader.initialize(new FileSplit(new ClassPathResource("Churn_Modelling.csv").getFile()));
TransformProcessRecordReader transformProcessRecordReader = new TransformProcessRecordReader(reader,transformProcess);
System.out.println("args = " + transformProcessRecordReader.next() + "");
我刚刚尝试打印第一条记录:
args = [619, 1, 0, 0, 1, 42, 2, 0, 1, 1, 1, 101348.88, 1]
例如后面跟着619的三个值-> 1,0,0
我想保留 619 后跟 0, 0.
基本上我想保留第一个类别作为基本类别,其他类别是从基本类别预测的,以避免任何多重共线关系(虚拟变量陷阱)
我该怎么做?任何人都可以对此提出建议吗?
您可以使用 transformProcess.finalSchema
检查最终的转换模式,并使用
删除相应的第二列
TransformProcess transformProcess = ... same as before...
.categoricalToOneHot("Geography")
.removeColumns("Geography[France]")
.build()
这是我的 CSV 数据提取和转换代码:
Schema schema = new Schema.Builder()
.addColumnsString("RowNumber")
.addColumnInteger("CustomerId")
.addColumnString("Surname")
.addColumnInteger("CreditScore")
.addColumnCategorical("Geography",Arrays.asList("France","Spain","Germany"))
.addColumnCategorical("Gender",Arrays.asList("Male","Female"))
.addColumnsInteger("Age","Tenure","Balance","NumOfProducts","HasCrCard","IsActiveMember","EstimatedSalary","Exited").build();
TransformProcess transformProcess = new TransformProcess.Builder(schema)
.removeColumns("RowNumber","Surname","CustomerId")
.categoricalToInteger("Gender")
.categoricalToOneHot("Geography").build();
RecordReader reader = new CSVRecordReader(1,',');
reader.initialize(new FileSplit(new ClassPathResource("Churn_Modelling.csv").getFile()));
TransformProcessRecordReader transformProcessRecordReader = new TransformProcessRecordReader(reader,transformProcess);
System.out.println("args = " + transformProcessRecordReader.next() + "");
我刚刚尝试打印第一条记录:
args = [619, 1, 0, 0, 1, 42, 2, 0, 1, 1, 1, 101348.88, 1]
例如后面跟着619的三个值-> 1,0,0 我想保留 619 后跟 0, 0.
基本上我想保留第一个类别作为基本类别,其他类别是从基本类别预测的,以避免任何多重共线关系(虚拟变量陷阱)
我该怎么做?任何人都可以对此提出建议吗?
您可以使用 transformProcess.finalSchema
检查最终的转换模式,并使用
TransformProcess transformProcess = ... same as before...
.categoricalToOneHot("Geography")
.removeColumns("Geography[France]")
.build()