Jsoup 中的新行字符处理

New line character handling in Jsoup

当使用 JSoup 解析 html 时,如果文本字符串中有一个换行符,它会将其视为不存在。考虑:This string of text will wrap here because of a new line character。但是当 JSoup 解析这个字符串时它 returns This string of text will wraphere because of a new line character。请注意,换行符甚至不会变成 space。我只想将其 return 编辑为 space。这是节点内的文本。我在 Whosebug 上看到过其他解决方案,人们希望或不希望在标记后换行。那不是我想要的。我只是想知道我是否可以将解析函数修改为 return 而不是忽略换行符。

你能试试吗,根据这里的答案获取全文:Prevent Jsoup from discarding extra whitespace

/**
 * @param cell element that contains whitespace formatting
 * @return
 */
public static String getText(Element cell) {
    String text = null;
    List<Node> childNodes = cell.childNodes();
    if (childNodes.size() > 0) {
        Node childNode = childNodes.get(0);
        if (childNode instanceof TextNode) {
            text = ((TextNode)childNode).getWholeText();
        }
    }
    if (text == null) {
        text = cell.text();
    }
    return text;
}

我明白了。我在从 url 获取 html 时犯了一个错误。我正在使用这种方法:

public static String getUrl(String url) {
    URL urlObj = null;
    try{
        urlObj = new URL(url);
    }
    catch(MalformedURLException e) {
        System.out.println("The url was malformed!");
        return "";
    }
    URLConnection urlCon = null;
    BufferedReader in = null;
    String outputText = "";
    try{
        urlCon = urlObj.openConnection();
        in = new BufferedReader(new InputStreamReader(urlCon.getInputStream()));
        String line = "";
        while((line = in.readLine()) != null){
            outputText += line;
        }
        in.close();
    }
    catch(IOException e){
        System.out.println("There was an error connecting to the URL");
        return "no";
        }
    return outputText;
}

当我应该使用以下内容时:

public static String getUrl(String url) {
    URL urlObj = null;
    try{
        urlObj = new URL(url);
    }
    catch(MalformedURLException e) {
        System.out.println("The url was malformed!");
        return "";
    }
    URLConnection urlCon = null;
    BufferedReader in = null;
    String outputText = "";
    try{
        urlCon = urlObj.openConnection();
        in = new BufferedReader(new InputStreamReader(urlCon.getInputStream()));
        String line = "";
        while((line = in.readLine()) != null){
            outputText += line + "/n";
        }
        in.close();
    }
    catch(IOException e){
        System.out.println("There was an error connecting to the URL");
        return "no";
        }
    return outputText;
}

问题与JSoup无关。我想我会在这里记下它,因为我使用 Java 从 Instant Web Scraping 复制了这段代码 Ryan Mitchell 和任何其他遵循本教程的人可能有同样的问题。