从几秒钟后重新加载页面的网站上抓取 HTML

Scrape HTML from websites which reloads page after few seconds

我想使用 Jsoup 和 HtmlUnit 从 http://www3.mangafreak.net/Manga/One_Piece 等网站抓取 HTML。像这样的网站的问题首先是

Status Code:503 Service Temporarily Unavailable

然后几秒钟后它重新加载页面

Status Code:200 OK

试试这个(仅限 HtmlUnit)

    WebClient webClient = new WebClient();
    webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);

    HtmlPage page = (HtmlPage) webClient.getPage("http://www3.mangafreak.net/Manga/One_Piece");
    System.out.println(page.asXml());

    WebWindow window = page.getEnclosingWindow();
    window.getJobManager().waitForJobsStartingBefore(5000);

    page = (HtmlPage) window.getEnclosedPage();
    System.out.println(page.asXml());

不,你有这个页面,你可以使用 HtmlUnit API 来享受 DOM 树的乐趣或点击某些东西....