有什么方法可以将 Weka j48 决策树输出映射为 RDF 格式吗?

Is there any way to map Weka j48 decision tree output to RDF format?

我想使用基于 Weka j48 决策树输出的 Jena 创建一个 ontology。但是这个输出在输入到 Jena 之前需要映射成 RDF 格式。有什么办法可以做这个映射吗?

编辑 1:

映射前j48决策树输出样本部分:

决策树输出对应的RDF样本部分:

这 2 个屏幕来自这篇研究论文(幻灯片 4):

Efficient Spam Email Filtering using Adaptive Ontology

可能没有内置的方法来做到这一点。

免责声明:我以前从未使用过 Jena 和 RDF。所以这个答案可能不完整或错过了预期转换的要点。

但是,首先,简短的咆哮:


<rant>

论文中发布的片段(即 Weka classifier 和 RDF 的输出)不完整且明显不一致。转换的过程完全没有描述。相反,他们只提到:

The challenge we faced was mainly to make J48 classification outputs to RDF and gave it to Jena

(原文如此!)

现在,他们以某种方式解决了它。他们本可以在 public 开源存储库中提供他们的转换代码。这将允许其他人提供改进,这将增加他们方法的可见性和 可验证性 。但是,相反,他们浪费了他们的时间和读者的时间,用各种网站的屏幕截图作为页面填充,可怜地试图从他们的方法中挤出另一个 publication。

</rant>


以下是我尽最大努力提供转换可能需要的一些构建基块的方法。必须对它持保留态度,因为我不熟悉底层方法和库。不过还是希望可以算作"helpful"

Weka Classifier 实现通常不提供它们用于内部工作的结构。所以无法直接访问内部树结构。但是,有一种方法 prefix() returns 树的字符串表示形式。

下面的代码包含一个非常 实用的(因此,有些脆弱...)方法,该方法解析此字符串并构建包含相关信息的树结构。此结构由 TreeNode 个对象组成:

static class TreeNode
{
    String label;
    String attribute;
    String relation;
    String value;
    ...
}
  • label 是用于 class 标识符的 class 标签。这只是叶节点的非null。对于论文中的示例,这将是 "0""1",指示电子邮件是否为垃圾邮件。

  • attribute 是决策所依据的属性。对于论文中的示例,这样的属性可能是 word_freq_remove

  • relationvalue是表示判断标准的字符串。例如,这些可能是 "<=""0.08"

创建这样的树结构后,可以将其转换为 Apache Jena Model 实例。代码中包含这样一个转换方法,但由于我对RDF不熟悉,我不确定它在概念上是否有意义。为了从这个树结构中创建 "desired" RDF 结构,可能需要进行调整。但是天真地,输出看起来很有意义。

import java.io.FileInputStream;
import java.util.ArrayList;
import java.util.List;

import org.apache.jena.rdf.model.Model;
import org.apache.jena.rdf.model.ModelFactory;
import org.apache.jena.rdf.model.Property;
import org.apache.jena.rdf.model.Resource;
import org.apache.jena.rdf.model.Statement;

import weka.classifiers.trees.J48;
import weka.core.Instances;
import weka.core.converters.ArffLoader;

public class WekaClassifierToRdf
{
    public static void main(String[] args) throws Exception
    {
        String fileName = "./data/iris.arff";
        ArffLoader arffLoader = new ArffLoader();
        arffLoader.setSource(new FileInputStream(fileName));
        Instances instances = arffLoader.getDataSet();
        instances.setClassIndex(4);
        //System.out.println(instances);

        J48 classifier = new J48();
        classifier.buildClassifier(instances);

        System.out.println(classifier);

        String prefixTreeString = classifier.prefix();
        TreeNode node = processPrefixTreeString(prefixTreeString);

        System.out.println("Tree:");
        System.out.println(node.createString());

        Model model = createModel(node);

        System.out.println("Model:");
        model.write(System.out, "RDF/XML-ABBREV");
    }

    private static TreeNode processPrefixTreeString(String inputString)
    {
        String string = inputString.replaceAll("\n", "");

        //System.out.println("Input is " + string);

        int open = string.indexOf("[");
        int close = string.lastIndexOf("]");
        String part = string.substring(open + 1, close);

        //System.out.println("Part " + part);

        int colon = part.indexOf(":");
        if (colon == -1)
        {
            TreeNode node = new TreeNode();

            int openAfterLabel = part.lastIndexOf("(");
            String label = part.substring(0, openAfterLabel).trim();
            node.label = label;
            return node;
        }

        String attributeName = part.substring(0, colon);

        //System.out.println("attributeName " + attributeName);

        int comma = part.indexOf(",", colon);

        int leftOpen = part.indexOf("[", comma);

        String leftCondition = part.substring(colon + 1, comma).trim();
        String rightCondition = part.substring(comma + 1, leftOpen).trim();

        int leftSpace = leftCondition.indexOf(" ");
        String leftRelation = leftCondition.substring(0, leftSpace).trim();
        String leftValue = leftCondition.substring(leftSpace + 1).trim();

        int rightSpace = rightCondition.indexOf(" ");
        String rightRelation = rightCondition.substring(0, rightSpace).trim();
        String rightValue = rightCondition.substring(rightSpace + 1).trim();

        //System.out.println("leftCondition " + leftCondition);
        //System.out.println("rightCondition " + rightCondition);

        int leftClose = findClosing(part, leftOpen + 1);
        String left = part.substring(leftOpen, leftClose + 1);

        //System.out.println("left " + left);

        int rightOpen = part.indexOf("[", leftClose);
        int rightClose = findClosing(part, rightOpen + 1);
        String right = part.substring(rightOpen, rightClose + 1);

        //System.out.println("right " + right);

        TreeNode leftNode = processPrefixTreeString(left);
        leftNode.relation = leftRelation;
        leftNode.value = leftValue;

        TreeNode rightNode = processPrefixTreeString(right);
        rightNode.relation = rightRelation;
        rightNode.value = rightValue;

        TreeNode result = new TreeNode();
        result.attribute = attributeName;
        result.children.add(leftNode);
        result.children.add(rightNode);
        return result;

    }

    private static int findClosing(String string, int startIndex)
    {
        int stack = 0;
        for (int i=startIndex; i<string.length(); i++)
        {
            char c = string.charAt(i);
            if (c == '[')
            {
                stack++;
            }
            if (c == ']')
            {
                if (stack == 0)
                {
                    return i;
                }
                stack--;
            }
        }
        return -1;
    }

    static class TreeNode
    {
        String label;
        String attribute;
        String relation;
        String value;
        List<TreeNode> children = new ArrayList<TreeNode>();

        String createString()
        {
            StringBuilder sb = new StringBuilder();
            createString("", sb);
            return sb.toString();
        }

        private void createString(String indent, StringBuilder sb)
        {
            if (children.isEmpty())
            {
                sb.append(indent + label);
            }
            sb.append("\n");
            for (TreeNode child : children)
            {
                sb.append(indent + "if " + attribute + " " + child.relation
                    + " " + child.value + ": ");
                child.createString(indent + "  ", sb);
            }
        }

        @Override
        public String toString()
        {
            return "TreeNode [label=" + label + ", attribute=" + attribute
                + ", relation=" + relation + ", value=" + value + "]";
        }
    }    

    private static String createPropertyString(TreeNode node)
    {
        if ("<".equals(node.relation))
        {
            return "lt_" + node.value;
        }
        if ("<=".equals(node.relation))
        {
            return "lte_" + node.value;
        }
        if (">".equals(node.relation))
        {
            return "gt_" + node.value;
        }
        if (">=".equals(node.relation))
        {
            return "gte_" + node.value;
        }
        System.err.println("Unknown relation: " + node.relation);
        return "UNKNOWN";
    }    

    static Model createModel(TreeNode node)
    {
        Model model = ModelFactory.createDefaultModel();

        String baseUri = "http://www.example.com/example#";
        model.createResource(baseUri);
        model.setNsPrefix("base", baseUri);
        populateModel(model, baseUri, node, node.attribute);
        return model;
    }

    private static void populateModel(Model model, String baseUri,
        TreeNode node, String resourceName)
    {
        //System.out.println("Populate with " + resourceName);

        for (TreeNode child : node.children)
        {
            if (child.label != null)
            {
                Resource resource =
                    model.createResource(baseUri + resourceName);
                String propertyString = createPropertyString(child);
                Property property =
                    model.createProperty(baseUri, propertyString);
                Statement statement = model.createLiteralStatement(resource,
                    property, child.label);
                model.add(statement);
            }
            else
            {
                Resource resource =
                    model.createResource(baseUri + resourceName);
                String propertyString = createPropertyString(child);
                Property property =
                    model.createProperty(baseUri, propertyString);

                String nextResourceName = resourceName + "_" + child.attribute;
                Resource childResource =
                    model.createResource(baseUri + nextResourceName);
                Statement statement =
                    model.createStatement(resource, property, childResource);
                model.add(statement);
            }
        }
        for (TreeNode child : node.children)
        {
            String nextResourceName = resourceName + "_" + child.attribute;
            populateModel(model, baseUri, child, nextResourceName);
        }
    }

}

该程序从 ARFF 文件解析著名的 Iris 数据集,运行 J48 classifier,构建树结构并生成和打印 RDF 模型。此处显示输出:

classifier,由 Weka 打印:

J48 pruned tree
------------------

petalwidth <= 0.6: Iris-setosa (50.0)
petalwidth > 0.6
|   petalwidth <= 1.7
|   |   petallength <= 4.9: Iris-versicolor (48.0/1.0)
|   |   petallength > 4.9
|   |   |   petalwidth <= 1.5: Iris-virginica (3.0)
|   |   |   petalwidth > 1.5: Iris-versicolor (3.0/1.0)
|   petalwidth > 1.7: Iris-virginica (46.0/1.0)

Number of Leaves  :     5

Size of the tree :     9

内部构建的树结构的字符串表示:

Tree:

if petalwidth <= 0.6:   Iris-setosa
if petalwidth > 0.6: 
  if petalwidth <= 1.7: 
    if petallength <= 4.9:       Iris-versicolor
    if petallength > 4.9: 
      if petalwidth <= 1.5:         Iris-virginica
      if petalwidth > 1.5:         Iris-versicolor
  if petalwidth > 1.7:     Iris-virginica

生成的 RDF 模型:

Model:
<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:base="http://www.example.com/example#">
  <rdf:Description rdf:about="http://www.example.com/example#petalwidth">
    <base:gt_0.6>
      <rdf:Description rdf:about="http://www.example.com/example#petalwidth_petalwidth">
        <base:gt_1.7>Iris-virginica</base:gt_1.7>
        <base:lte_1.7>
          <rdf:Description rdf:about="http://www.example.com/example#petalwidth_petalwidth_petallength">
            <base:gt_4.9>
              <rdf:Description rdf:about="http://www.example.com/example#petalwidth_petalwidth_petallength_petalwidth">
                <base:gt_1.5>Iris-versicolor</base:gt_1.5>
                <base:lte_1.5>Iris-virginica</base:lte_1.5>
              </rdf:Description>
            </base:gt_4.9>
            <base:lte_4.9>Iris-versicolor</base:lte_4.9>
          </rdf:Description>
        </base:lte_1.7>
      </rdf:Description>
    </base:gt_0.6>
    <base:lte_0.6>Iris-setosa</base:lte_0.6>
  </rdf:Description>
</rdf:RDF>