JAVA 中的 Dbpedia 资源解析

Question

通过使用 DBpedia Spotlight，我得到了 DBpedia URI。例如

http://dbpedia.org/resource/Part-of-speech_tagging

我需要在 Java 中请求此 URI，以便它可以 return 我一些 json/xml 我可以从响应中获取必要的信息。

例如，在上面提到的URI中，我需要dct:subject

的值

下面是我在浏览器中得到的响应的截图。

Answer 1

我不确定您要查找哪些值，但您应该能够在没有任何依赖性的情况下从页面源代码中获取您想要的内容。下面提供的四种 Java 方法应该可以满足您的需求（一种方法是支持方法）。

Getting the Web Page HTML Source:

首先我们使用getWebPageSource()方法获取网页HTML源。此方法将获取构成网页的整个 HTML 源代码，该网页位于所提供的 Link 字符串中。源在列表接口 object (List) 中返回。示例用法为：

String sourceLinkString = "http://dbpedia.org/resource/Part-of-speech_tagging";
List<String> pageSource = getWebPageSource(sourceLinkString);

当此代码为运行时，pageSource 列表变量将包含所有 HTML 源代码您提供的网络 link 字符串，在本例中为："http://dbpedia.org/resource/Part-of-speech_tagging"。如果你愿意，你可以创建一个循环来遍历列表并使用 System.out.println() 方法在你的控制台 Window 中显示它，如下所示：

for (int i = 0; i < pageSource.size(); i++) {
    System.out.println(pageSource.get(i));
}

Getting Related Links Using A Reference String:

现在您已经有了网页源，您可以定位并获取您想要的数据。下一个方法是 getRelatedLinks() 方法。此方法将检索包含在特定提供的字符串标记之间的所有 link，其中所需的 Link 可能位于提供的 参考字符串 之间并与之相关。在您的情况下，参考字符串将是："rel=\"dct:subject\""。字符串开始标记为 "href=\""，字符串结束标记为 "\">"。因此，将查看包含 "rel=\"dct:subject\"" 参考字符串的任何网页源代码行，如果 在同一源代码行上 则提供的起始标记字符串 ("href=\"")并找到提供的结束标记字符串 ("\">")，然后检索这些标记之间的文本。示例用法为：

String sourceLinkString = "http://dbpedia.org/resource/Part-of-speech_tagging";
List<String> pageSource = getWebPageSource(sourceLinkString);
String[] relatedLinksTo = getRelatedLinks("rel=\"dct:subject\"", pageSource, "href=\"", "\">");

与参考字符串 "rel=\"dct:subject\"" 相关的所有 link 现在将保存在名为 relatedLink 的字符串数组变量中sTo。如果您要遍历数组并将其内容显示到控制台 Window:

// Display Related Links...
for (int i = 0; i < relatedLinksTo.length; i++) {
    System.out.println(relatedLinksTo[i]);
}

你会看到：

http://dbpedia.org/resource/Category:Corpus_linguistics
http://dbpedia.org/resource/Category:Markov_models
http://dbpedia.org/resource/Category:Tasks_of_natural_language_processing
http://dbpedia.org/resource/Category:Word-sense_disambiguation

如果您只想要每个 link 相关的标题而不是整个 Link 字符串，那么您可以这样做：

// Display Related Links Titles...
for (int i = 0; i < relatedLinksTo.length; i++) {
    String rLink = relatedLinksTo[i].substring(relatedLinksTo[i].lastIndexOf(":") + 1);
    System.out.println(rLink);
}

您将在控制台 Window 中看到的是：

Corpus_linguistics
Markov_models
Tasks_of_natural_language_processing
Word-sense_disambiguation

此方法利用下面提供的名为 getBetween() 的支持方法。

Getting A Specific Link From A Related Link List:

您可能不需要整个相关 Link 列表，而只需要一个或多个特定的 link 到特定标题，例如：Tasks_of_natural_language_processing。要获得一个或多个 link，您可以使用 getFromRelatedLinksThatContain() 方法。以下是实现此目标的方法：

String sourceLinkString = "http://dbpedia.org/resource/Part-of-speech_tagging";
List<String> pageSource = getWebPageSource(sourceLinkString);
String[] relatedLinksTo = getRelatedLinks("rel=\"dct:subject\"", pageSource, "href=\"", "\">");
String[] desiredLinks = getFromRelatedLinksThatContain(relatedLinksTo, "Tasks_of_natural_language_processing");

此方法要求您传递从 getRelatedLinks() 方法返回的内容以及您希望 Link 获得的所需标题(Tasks_of_natural_language_processing)。标题必须是任何 link 中包含的实际文本。如果您现在要遍历 desiredLinks 数组：

for (int i = 0; i < desiredLinks.length; i++) {
    System.out.println(desiredLinks[i]);
}

您将在控制台中看到以下 Link 字符串 Window：

http://dbpedia.org/resource/Category:Tasks_of_natural_language_processing.

The TESTED Methods:

/**
 * Returns a List ArrayList containing the page source for the supplied web
 * page link.<br><br>
 *
 * @param link (String) The URL address of the web page to process.<br>
 *
 * @return (List ArrayList) A List ArrayList containing the page source for
 *         the supplied web page link.
 */
public List<String> getWebPageSource(String webLink) {
    if (webLink.equals("")) {
        return null;
    }
    try {
        URL url = new URL(webLink);

        URLConnection yc;
        //If url is a SSL Endpoint (using a Secure Socket Layer such as https)...
        if (webLink.startsWith("https:")) {
            yc = new URL(webLink).openConnection();
            //send request for page data...
            yc.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");
            yc.connect();
        }
        //and if not a SLL Endpoint (just http)...
        else {
            yc = url.openConnection();
        }

        InputStream inputStream = yc.getInputStream();
        InputStreamReader streamReader = null;
        String encoding = null;
        try {
            encoding = yc.getContentEncoding().toLowerCase();
        }
        catch (Exception ex) {
        }
        if (null == encoding) {
            encoding = "UTF-8";
            streamReader = new InputStreamReader(yc.getInputStream(), encoding);
        }
        else {
            switch (encoding) {
                case "gzip":
                    // Is compressed using GZip: Wrap the reader
                    inputStream = new GZIPInputStream(inputStream);
                    streamReader = new InputStreamReader(inputStream);
                    break;
                //streamReader = new InputStreamReader(inputStream);
                case "utf-8":
                    encoding = "UTF-8";
                    streamReader = new InputStreamReader(yc.getInputStream(), encoding);
                    break;
                case "utf-16":
                    encoding = "UTF-16";
                    streamReader = new InputStreamReader(yc.getInputStream(), encoding);
                    break;
                default:
                    break;
            }
        }

        List<String> sourceText;
        try (BufferedReader in = new BufferedReader(streamReader)) {
            String inputLine;
            sourceText = new ArrayList<>();
            while ((inputLine = in.readLine()) != null) {
                sourceText.add(inputLine);
            }
        }
        return sourceText;
    }
    catch (MalformedURLException ex) {
        // Do whatever you want with exception.
        ex.printStackTrace();
    }
    catch (IOException ex) {
        // Do whatever you want with exception.
        ex.printStackTrace();
    }
    return null;
}

/**
 * This method will retrieve all links which are contained between specifically 
 * supplied String Tags where the desired Links may reside between and are related 
 * to the supplied <b>Reference String</b>. A String Start Tag and a String End Tag 
 * would be required as well.<br><br>
 * 
 * So, if any Web Page Source line that contains the Reference String of:<pre>
 * 
 *     "rel=\"dct:subject\""</pre><br>
 * 
 * is looked at and if <i>on the same source line</i> the supplied Start Tag 
 * String (ie: "href=\"") and the supplied End Tag String (ie: "\">") are found then 
 * the text between those tags is retrieved.<br><br>
 * 
 * This method utilizes the support method named <b>getBetween()</b>.<br><br>
 * 
 * @param referenceString (String) The reference string to look for on any web 
 * page source line.<br>
 * 
 * @param pageSource (List Interface of String) The List which contains all the 
 * HTML Web Page Source.<br>
 * 
 * @param desiredLinkStartTag (String) The Start Tag String where the desired 
 * Link or links may reside after. This can be any string. Links are retrieved 
 * from between the Start Tag and the End Tag.<br>
 * 
 * @param desiredLinkEndTag (String) The End Tag String where the desired 
 * Link or links may reside before. This can be any string. Links are retrieved 
 * from between the Start Tag and the End Tag.<br>
 * 
 * @return (1D String Array) A String Array containing the Links Found.<br>
 * 
 * @see #getBetween(java.lang.String, java.lang.String, java.lang.String, boolean...) getBetween()
 */
public String[] getRelatedLinks(String referenceString, List<String> pageSource, 
        String desiredLinkStartTag, String desiredLinkEndTag) {
    List<String> links = new ArrayList<>();
    for (int i = 0; i < pageSource.size(); i++) {
        if (pageSource.get(i).contains(referenceString)) {
            String[] lnks = getBetween(pageSource.get(i), desiredLinkStartTag, desiredLinkEndTag);
            links.addAll(Arrays.asList(lnks));
        }
    }
    return links.toArray(new String[0]);
}

/**
 * Retrieves a specific Link from within the Related Links List generated by 
 * the <b>getRelatedLinks()</b> method.<br><br>
 * 
 * This method requires the use of the <b>getRelatedLinks()</b> method.
 * 
 * @param relatedArray (1D String Array) The array returned from the <b>getRelatedLinks()</b> 
 * method.<br>
 * 
 * @param desiredStringInLink (String - Letter Case Sensitive) The string title 
 * contained within the link to retrieve.<br>
 * 
 * @return (1D String Array) Containing any links found.<br>
 * 
 * @see #getRelatedLinks(java.lang.String, java.util.List, java.lang.String, java.lang.String) getRelatedLinks()
 * 
 */
public String[] getFromRelatedLinksThatContain(String[] relatedArray, String desiredStringInLink) {
    List<String> desiredLinks = new ArrayList<>();
    for (int i = 0; i < relatedArray.length; i++) {
        if (relatedArray[i].contains(desiredStringInLink)) {
            desiredLinks.add(relatedArray[i]);
        }
    }
    return desiredLinks.toArray(new String[0]);
}

/**
 * Retrieves any string data located between the supplied string leftString
 * parameter and the supplied string rightString parameter.<br><br>

 * This method will return all instances of a substring located between the
 * supplied Left String and the supplied Right String which may be found
 * within the supplied Input String.<br>
 *
 * @param inputString (String) The string to look for substring(s) in.
 *
 * @param leftString  (String) What may be to the Left side of the substring
 *                    we want within the main input string. Sometimes the
 *                    substring you want may be contained at the very
 *                    beginning of a string and therefore there is no
 *                    Left-String available. In this case you would simply
 *                    pass a Null String ("") to this parameter which
 *                    basically informs the method of this fact. Null can
 *                    not be supplied and will ultimately generate a
 *                    NullPointerException.
 *
 * @param rightString (String) What may be to the Right side of the
 *                    substring we want within the main input string.
 *                    Sometimes the substring you want may be contained at
 *                    the very end of a string and therefore there is no
 *                    Right-String available. In this case you would simply
 *                    pass a Null String ("") to this parameter which
 *                    basically informs the method of this fact. Null can
 *                    not be supplied and will ultimately generate a
 *                    NullPointerException.
 *
 * @param options     (Optional - Boolean - 2 Parameters):<pre>
 *
 *      ignoreLetterCase    - Default is false. This option works against the
 *                            string supplied within the leftString parameter
 *                            and the string supplied within the rightString
 *                            parameter. If set to true then letter case is
 *                            ignored when searching for strings supplied in
 *                            these two parameters. If left at default false
 *                            then letter case is not ignored.
 *
 *      trimFound           - Default is true. By default this method will trim
 *                            off leading and trailing white-spaces from found
 *                            sub-string items. General sentences which obviously
 *                            contain spaces will almost always give you a white-
 *                            space within an extracted sub-string. By setting
 *                            this parameter to false, leading and trailing white-
 *                            spaces are not trimmed off before they are placed
 *                            into the returned Array.</pre>
 *
 * @return (1D String Array) Returns a Single Dimensional String Array
 *         containing all the sub-strings found within the supplied Input
 *         String which are between the supplied Left String and supplied
 *         Right String. You can shorten this method up a little by
 *         returning a List&lt;String&gt; ArrayList and removing the 'List
 *         to 1D Array' conversion code at the end of this method. This
 *         method initially stores its findings within a List object
 *         anyways.
 */
public static String[] getBetween(String inputString, String leftString, String rightString, boolean... options) {
    // Return nothing if nothing was supplied.
    if (inputString.equals("") || (leftString.equals("") && rightString.equals(""))) {
        return null;
    }

    // Prepare optional parameters if any supplied.
    // If none supplied then use Defaults...
    boolean ignoreCase = false; // Default.
    boolean trimFound = true;   // Default.
    if (options.length > 0) {
        if (options.length >= 1) {
            ignoreCase = options[0];
        }
        if (options.length >= 2) {
            trimFound = options[1];
        }
    }

    // Remove any ASCII control characters from the
    // supplied string (if they exist).
    String modString = inputString.replaceAll("\p{Cntrl}", "");

    // Establish a List String Array Object to hold
    // our found substrings between the supplied Left
    // String and supplied Right String.
    List<String> list = new ArrayList<>();

    // Use Pattern Matching to locate our possible
    // substrings within the supplied Input String.
    String regEx = Pattern.quote(leftString)
            + (!rightString.equals("") ? "(.*?)" : "(.*)?")
            + Pattern.quote(rightString);
    if (ignoreCase) {
        regEx = "(?i)" + regEx;
    }
    Pattern pattern = Pattern.compile(regEx);
    Matcher matcher = pattern.matcher(modString);
    while (matcher.find()) {
        // Add the found substrings into the List.
        String found = matcher.group(1);
        if (trimFound) {
            found = found.trim();
        }
        list.add(found);
    }

    String[] res;
    // Convert the ArrayList to a 1D String Array.
    // If the List contains something then convert
    if (list.size() > 0) {
        res = new String[list.size()];
        res = list.toArray(res);
    } // Otherwise return Null.
    else {
        res = null;
    }
    // Return the String Array.
    return res;
}

或者...使用 SPARQL 或任何其他需要的解析器，例如 jSON。

Answer 2

您的问题中没有足够的信息说明您要实现的目标，因此无法提供实现该目标的最佳途径。您可以考虑使用 Jena or RDF4J/Sesame 框架。

或者您可以考虑只向 DBpedia 端点询问您想要的东西，无论是 complete description of <http://dbpedia.org/resource/Part-of-speech_tagging>, here in JSON (as linked from the Formats menu seen in your screencap), or using a SPARQL query URI to request just the dct:subject values --

PREFIX dbr: <http://dbpedia.org/resource/>
SELECT DISTINCT ?subject
  WHERE { dbr:Part-of-speech_tagging dct:subject ?subject }
LIMIT 100

-- 可能会在各种序列化中检索到 -- here in JSON。

JAVA 中的 Dbpedia 资源解析

Dbpedia resource parsing in JAVA

java

sparql

dbpedia