Weka 过滤器导致数据丢失
Weka filters cause data loss
我正在使用 weka 构建随机森林模型。
我的数据存储在 MySQL 数据库中。我找不到直接从数据库创建 weka 数据集('Instances' 对象)的方法(至少不是一个简单的方法),所以我查询数据库并将数据操作到 weka 数据集(实例)中此代码:
List<MetadataRecord> metadata = acquireMetadata(); // Loading from DB
int datasetSize = metadata.size();
int numFeatures = MetadataRecord.FEATURE_NUM; // Currently set to 14
ArrayList<Attribute> atts = new ArrayList<Attribute>();
List<Instance> instances = new ArrayList<Instance>();
for (int feature = 0; feature < numFeatures; feature++) {
Attribute current = new Attribute("Attribute" + feature, feature);
if (feature == 0) {
for (int obj = 0; obj < datasetSize; obj++) {
instances.add(new SparseInstance(numFeatures));
}
}
for (int obj = 0; obj < datasetSize; obj++) {
MetadataRecord record = metadata.get(obj);
Instance inst = instances.get(obj);
switch (feature) {
case 0:
inst.setValue(current, record.labelId);
break;
case 1:
inst.setValue(current, record.isSecured ? 2 : 1);
break;
case 2:
inst.setValue(current, record.pageCount);
break;
// Spared cases 3-13...
}
}
atts.add(current);
}
Instances newDataset = new Instances("Metadata", atts, instances.size());
for (Instance inst : instances) {
newDataset.add(inst);
}
newDataset.setClassIndex(0);
大部分数据输入为 'numeric',而我需要一些特征(第一和第二)是分类的(或 "Nominal",根据 weka 术语)。
我尝试使用过滤器将它们转换为标称值:
NumericToNominal nomFilter = new NumericToNominal();
nomFilter.setAttributeIndicesArray(new int[] { 0, 1 });
nomFilter.setInputFormat(newDataset);
newDataset = Filter.useFilter(newDataset, nomFilter);
这很好用,但令人惊讶的是,在调试数据集时,部分数据丢失了!
应用过滤器之前:
@attribute Attribute0 numeric
@attribute Attribute1 numeric
@attribute Attribute2 numeric
// Spared the other 10 Attributes
@data
{0 1005,1 1,2 19,3 1123,4 7,5 25,6 0.66,7 49,8 2892.21,9 5.32,10 22.63,11 0.4,12 48.95,13 5.29}
应用过滤器后:
@attribute Attribute0 {0,2,3,4,5,6,7,9,11,12,18,22,23,24,25,35,36,39,40,45,51,56,60,67,68,69,78,79,83,84,85,88,94,98,126,127,128,1001,1003,1004,1005,1007,1008,1009,1012,1013,1017,1018,1019,1022}
@attribute Attribute1 {1,2}
@attribute Attribute2 numeric
// Spared the other 10 Attributes
@data
{0 1005,2 19,3 1123,4 7,5 25,6 0.66,7 49,8 2892.21,9 5.32,10 22.63,11 0.4,12 48.95,13 5.29}
为什么我丢失了第二个属性的值?
这个特征并没有丢失,只是没有明确包含在你的输出中,因为它是稀疏格式。看看 ARFF:
Sparse ARFF files are very similar to ARFF files, but data with value 0 are not be explicitly represented.
Sparse ARFF files have the same header (i.e @relation and @attribute tags) but the data section is different. Instead of representing each value in order, like this:
@data
0, X, 0, Y, "class A"
0, 0, W, 0, "class B"
the non-zero attributes are explicitly identified by attribute number and their value stated, like this:
@data
{1 X, 3 Y, 4 "class A"}
{2 W, 4 "class B"}
Each instance is surrounded by curly braces, and the format for each entry is: where index is the attribute index (starting from 0).
Note that the omitted values in a sparse instance are 0, they are not "missing" values! If a value is unknown, you must explicitly represent it with a question mark (?).
尤其最后一句很重要。您的 Attribute1
有两个可能的值,1 和 2。由于它现在是标称值,因此值 1 的索引为 0。索引为 0 的值将被忽略。
再说一遍:这只是内存中的表示,当您将其打印到文件或屏幕时。您的数据集实际内容没有变化。
我正在使用 weka 构建随机森林模型。 我的数据存储在 MySQL 数据库中。我找不到直接从数据库创建 weka 数据集('Instances' 对象)的方法(至少不是一个简单的方法),所以我查询数据库并将数据操作到 weka 数据集(实例)中此代码:
List<MetadataRecord> metadata = acquireMetadata(); // Loading from DB
int datasetSize = metadata.size();
int numFeatures = MetadataRecord.FEATURE_NUM; // Currently set to 14
ArrayList<Attribute> atts = new ArrayList<Attribute>();
List<Instance> instances = new ArrayList<Instance>();
for (int feature = 0; feature < numFeatures; feature++) {
Attribute current = new Attribute("Attribute" + feature, feature);
if (feature == 0) {
for (int obj = 0; obj < datasetSize; obj++) {
instances.add(new SparseInstance(numFeatures));
}
}
for (int obj = 0; obj < datasetSize; obj++) {
MetadataRecord record = metadata.get(obj);
Instance inst = instances.get(obj);
switch (feature) {
case 0:
inst.setValue(current, record.labelId);
break;
case 1:
inst.setValue(current, record.isSecured ? 2 : 1);
break;
case 2:
inst.setValue(current, record.pageCount);
break;
// Spared cases 3-13...
}
}
atts.add(current);
}
Instances newDataset = new Instances("Metadata", atts, instances.size());
for (Instance inst : instances) {
newDataset.add(inst);
}
newDataset.setClassIndex(0);
大部分数据输入为 'numeric',而我需要一些特征(第一和第二)是分类的(或 "Nominal",根据 weka 术语)。 我尝试使用过滤器将它们转换为标称值:
NumericToNominal nomFilter = new NumericToNominal();
nomFilter.setAttributeIndicesArray(new int[] { 0, 1 });
nomFilter.setInputFormat(newDataset);
newDataset = Filter.useFilter(newDataset, nomFilter);
这很好用,但令人惊讶的是,在调试数据集时,部分数据丢失了!
应用过滤器之前:
@attribute Attribute0 numeric
@attribute Attribute1 numeric
@attribute Attribute2 numeric
// Spared the other 10 Attributes
@data
{0 1005,1 1,2 19,3 1123,4 7,5 25,6 0.66,7 49,8 2892.21,9 5.32,10 22.63,11 0.4,12 48.95,13 5.29}
应用过滤器后:
@attribute Attribute0 {0,2,3,4,5,6,7,9,11,12,18,22,23,24,25,35,36,39,40,45,51,56,60,67,68,69,78,79,83,84,85,88,94,98,126,127,128,1001,1003,1004,1005,1007,1008,1009,1012,1013,1017,1018,1019,1022}
@attribute Attribute1 {1,2}
@attribute Attribute2 numeric
// Spared the other 10 Attributes
@data
{0 1005,2 19,3 1123,4 7,5 25,6 0.66,7 49,8 2892.21,9 5.32,10 22.63,11 0.4,12 48.95,13 5.29}
为什么我丢失了第二个属性的值?
这个特征并没有丢失,只是没有明确包含在你的输出中,因为它是稀疏格式。看看 ARFF:
Sparse ARFF files are very similar to ARFF files, but data with value 0 are not be explicitly represented.
Sparse ARFF files have the same header (i.e @relation and @attribute tags) but the data section is different. Instead of representing each value in order, like this:
@data 0, X, 0, Y, "class A" 0, 0, W, 0, "class B"
the non-zero attributes are explicitly identified by attribute number and their value stated, like this:
@data {1 X, 3 Y, 4 "class A"} {2 W, 4 "class B"}
Each instance is surrounded by curly braces, and the format for each entry is: where index is the attribute index (starting from 0).
Note that the omitted values in a sparse instance are 0, they are not "missing" values! If a value is unknown, you must explicitly represent it with a question mark (?).
尤其最后一句很重要。您的 Attribute1
有两个可能的值,1 和 2。由于它现在是标称值,因此值 1 的索引为 0。索引为 0 的值将被忽略。
再说一遍:这只是内存中的表示,当您将其打印到文件或屏幕时。您的数据集实际内容没有变化。