为什么 chrome devtools 中的 html 代码和 jsoup 解析的 html 代码不同？

Question

我正在尝试从 HADOOP Jira 问题站点 (https://issues.apache.org/jira/projects/HADOOP/issues/HADOOP-16381?filter=allopenissues)

中提取有关问题创建日期的信息

正如你在这个Screenshot中看到的，创建日期是class为实时戳记的时间标签之间的文本（例如<time class=livestamp ...> 'this text' </time>）

所以，我试着用下面的代码解析它。

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class CreatedDateExtractor {
    public static void main(String[] args) {
        String url = "https://issues.apache.org/jira/projects/HADOOP/issues/HADOOP-16381?filter=allopenissues";
        Document doc = null;

        try {
            doc = Jsoup.connect(url).get();
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

        Elements elements = doc.select("time.livestamp"); //This line finds elements that matches time tags with livestamp class
        System.out.println("# of elements : "+ elements.size());
        for(Element e: elements) {
            System.out.println(e.text());
        }   
    }
}

我希望提取创建日期，但实际输出是 # of elements : 0.

我发现这是错误的。所以，我尝试用下面的代码从那一边解析整个 html 代码。

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class CreatedDateExtractor {
    public static void main(String[] args) {
        String url = "https://issues.apache.org/jira/projects/HADOOP/issues/HADOOP-16381?filter=allopenissues";
        Document doc = null;

        try {
            doc = Jsoup.connect(url).get();
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

        Elements elements = doc.select("*"); //This line finds whole elements in html document.
        System.out.println("# of elements : "+ elements.size());
        for(Element e: elements) {
            System.out.println(e);
        }   
    }
}

我将chrome devtools中的html代码和我解析的html代码一一对比。然后我发现那些是不同的。

你能解释一下为什么会发生这种情况，并给我一些如何提取创建日期的建议吗？

Answer 1

我建议你获取带有 "time" 标签的元素，并使用 select 获取带有 "livestamp" class 的时间标签。这是示例：

Elements timeTags = doc.select("time");
Element timeLivestamp = null;
for(Element tag:timeTags){
  Element livestamp = tag.selectFirst(".livestamp");
  if(livestamp != null){
   timeLivestamp = livestamp;
   break;
   }

}

我不知道为什么，但是当我想对 Jsoup 的 .select() 方法使用超过 1 个 select 或（就像您使用的那样 time.livestamp）时，我得到了这样有趣的输出。

为什么 chrome devtools 中的 html 代码和 jsoup 解析的 html 代码不同？

Why html code in chrome devtools and html code parsed by jsoup are different?

html

java

html-parsing

jsoup

google-chrome-devtools