Java: HtmlUnit 检索页面标题时出现问题

Question

这是我的第一个 Whosebug post 所以我会尽力描述我的问题。

我想创建一个程序来从 TripAdvisor 页面检索评论，我尝试通过 API 来实现，但是当我请求 API 键时他们没有响应，所以我的替代方法是用 WebCrawler 做。

为此，我有一个 Spring 项目并使用了 HtmlUnit，一个我从未使用过的工具，所以为了测试它，我的第一个练习是检索网页的标题，所以我有以下代码实施：

@PostConstruct
public void init() throws Exception {
    TimeZone.setDefault(TimeZone.getTimeZone("Europe/Madrid"));

    getRequest.getPageName();

}

即调用以下方法：

@Test
public void getPageName() throws Exception {
    try (final WebClient webClient = new WebClient()) {
        final HtmlPage page = webClient.getPage("https://www.tripadvisor.com");
        
        System.out.println(page.getTitleText());

    }
    catch (Exception e){
        System.out.println("ERROR " + e);
    }
}

当我运行带有 https://www.google.com I get the response "Google" as excpected, but when I try it with https://www.tripadvisor.com or https://www.youtube.com 的代码时，我得到一个我无法理解的错误：

Caused by: net.sourceforge.htmlunit.corejs.javascript.EvaluatorException: syntax error (https://static.tacdn.com/assets/DDGchX.17d9b05f.js#1)

我进行了快速研究以了解问题的含义，我发现了几个关于类似案例的 post，但我不明白是什么原因。它与 Javascript 问题有关吗？还是权限问题？

如果需要更多信息来分析问题，请不要犹豫，提前感谢任何 reader 花费的时间，如果我不尊重任何 Whosebug rules/protocols，我深表歉意.

Answer 1

    try (final WebClient webClient = new WebClient()) {
        webClient.getOptions().setThrowExceptionOnScriptError(false);

        final HtmlPage page = webClient.getPage("https://www.tripadvisor.com");
        // final HtmlPage page = webClient.getPage("https://www.youtube.com");

        System.out.println("****************");
        System.out.println(page.getTitleText());
        System.out.println("****************");
    }
    catch (Exception e){
        System.out.println("ERROR " + e);
    }

至少在最近的版本中哦 HtmlUnit 这会产生

****************
Tripadvisor: Read Reviews, Compare Prices & Book
****************

setThrowExceptionOnScriptError 有什么作用？

/**
 * Changes the behavior of this webclient when a script error occurs.
 * @param enabled indicates if exception should be thrown or not
 */

HtmlUnit 使用 Rhino (https://github.com/mozilla/rhino) as base for the JavaScript support. And Rhino does not support all the language features available in JavaScript today (getting better with every version https://htmlunit.sourceforge.io/changes-report.html)。但至少有一些页面使用了这个功能（例如跟踪你），因此你会看到错误。 HtmlUnit 最初被设计为测试框架，因此它会在每次错误时停止。

如果你改变它（见上面的选项设置）你仍然会得到每个错误的日志输出但是 javascript 处理继续（在真实浏览器中相同）。您还可以更改日志记录 - 请参阅 https://htmlunit.sourceforge.io/logging.html.

Java: HtmlUnit 检索页面标题时出现问题

Java: HtmlUnit problem retrieving page title

javascript

java

web-crawler

htmlunit