使用 htmlunit 定位正确的 HTML

Targeting the correct HTML with htmlunit

概览

我正在开展一个项目,通过网络抓取当地剧院网站上正在播放的电影。我的目标是最终通过 JSON 将这些信息(电影标题、电影描述等)嵌入到每天早上发送的电子邮件中,让我们知道正在播放什么,而无需实际访问他们的网站或下载他们的应用程序。

此项目的基础 URL:https://www.landmarktheatres.com/albany-ny/spectrum-8-theatres

问题

使用 htmlunit 我已经成功地从 base url. However, included in these titles are the upcoming films which are also provided in the base url HTML.

中提取电影标题

我需要帮助来定位正确的 HTML。我当前的代码使用 HtmlElement 列表:

 List<HtmlElement> itemList = page.getByXPath("//li[@class='gridCol-s-12 gridCol-m-4 gridCol-l-4']");

然后我循环遍历该列表以提取标题:

String title = ((HtmlElement) htmlItem.getFirstByXPath(".//div[@class='filmItemCopy']")).asText();
String titleOnly = title.substring(0, title.indexOf("\n"));

我一直在检查 HTML 并且知道我需要定位:

<section class="gridRow section content">
<div class="navTabs">
<div class="navTabItem active" data-tab-item="#showing">

为了完成这个,我很确定我需要更改我的 List<HTMLElement> 以反映这一点,但我只是没有让它工作。我尝试了以下无济于事:

 List<HtmlElement> itemList = page.getByXPath("//div[@class='navTabItem active']");

预期输出

{"title":"FOUR GOOD DAYS"}
{"title":"LIMBO"}
{"title":"DEMON SLAYER THE MOVIE: MUGEN TRAIN (SUBTITLED)"}
{"title":"DEMON SLAYER THE MOVIE: MUGEN TRAIN (DUBBED)"}
{"title":"STREET GANG: HOW WE GOT TO SESAME STREET"}
{"title":"TOGETHER TOGETHER"}
{"title":"NOMADLAND"}
{"title":"THE TRUFFLE HUNTERS"}
{"title":"THE FATHER"}

当前输出

{"title":"FOUR GOOD DAYS"}
{"title":"LIMBO"}
{"title":"DEMON SLAYER THE MOVIE: MUGEN TRAIN (SUBTITLED)"}
{"title":"DEMON SLAYER THE MOVIE: MUGEN TRAIN (DUBBED)"}
{"title":"STREET GANG: HOW WE GOT TO SESAME STREET"}
{"title":"TOGETHER TOGETHER"}
{"title":"NOMADLAND"}
{"title":"THE TRUFFLE HUNTERS"}
{"title":"THE FATHER"}
{"title":"DREAM HORSE"}
{"title":"FINAL ACCOUNT"}
{"title":"FINDING YOU"}
{"title":"THE DRY"}
{"title":"THE HUMAN FACTOR"}
{"title":"WRATH OF MAN"}

代码

SpectrumFilmItems.java

package org.example;

public class SpectrumFilmItems {
    private String title;


    public SpectrumFilmItems(String title) {
        super();
        this.title = title;
    }


    public String getTitle(){
        return title;
    }

    public void setTitle(String title){
        this.title = title;
    }
}

SpectrumScraper.java

package org.example;

import java.util.List;

import com.fasterxml.jackson.databind.ObjectMapper;
import com.gargoylesoftware.htmlunit.SilentCssErrorHandler;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import com.gargoylesoftware.htmlunit.html.HtmlPage;

public class SpectrumScraper
{
    public static void main( String[] args )
    {
        // GET request to obtain HTML content from the web server.
        String baseUrl = "https://www.landmarktheatres.com/albany-ny/spectrum-8-theatres";
        WebClient client = new WebClient();
        client.setCssErrorHandler(new SilentCssErrorHandler());
        client.getOptions().setCssEnabled(false);
        client.getOptions().setJavaScriptEnabled(false);
        try {
            HtmlPage page = client.getPage(baseUrl);

            List<HtmlElement> itemList = page.getByXPath("//li[@class='gridCol-s-12 gridCol-m-4 gridCol-l-4']");

            if(itemList.isEmpty()){
                System.out.println("No item found.");
            }else {
                for (HtmlElement htmlItem : itemList) {
                    String title = ((HtmlElement) htmlItem.getFirstByXPath(".//div[@class='filmItemCopy']")).asText();
                    String titleOnly = title.substring(0, title.indexOf("\n"));


                    SpectrumFilmItems filmItem = new SpectrumFilmItems(titleOnly);

                    ObjectMapper mapper = new ObjectMapper();
                    String jsonString = mapper.writeValueAsString(filmItem);
                    System.out.println(jsonString);
                }
            }
        }
        catch(Exception e) {
            e.printStackTrace();
        }
    }
}

现有电影和未发行电影之间的一致差异是属性 data-film-sessiondata-film-exp。仅当条目具有其中一个或两个属性时才添加到列表中。这是未经测试的,可能行不通,但这是朝着正确方向迈出的一步。

for (HtmlElement htmlItem : itemList) {
    String dataFilmSession = htmlItem.getAttribute("data-film-session");

    if (dataFilmSession.equals(DomElement.ATTRIBUTE_NOT_DEFINED) || dataFilmSession.equals(DomElement.ATTRIBUTE_VALUE_EMPTY)) {
        continue;
    }
    // your original code
}