使用 htmlunit 定位正确的 HTML
Targeting the correct HTML with htmlunit
概览
我正在开展一个项目,通过网络抓取当地剧院网站上正在播放的电影。我的目标是最终通过 JSON 将这些信息(电影标题、电影描述等)嵌入到每天早上发送的电子邮件中,让我们知道正在播放什么,而无需实际访问他们的网站或下载他们的应用程序。
此项目的基础 URL:https://www.landmarktheatres.com/albany-ny/spectrum-8-theatres
问题
使用 htmlunit
我已经成功地从 base url. However, included in these titles are the upcoming films which are also provided in the base url HTML
.
中提取电影标题
我需要帮助来定位正确的 HTML
。我当前的代码使用 HtmlElement
列表:
List<HtmlElement> itemList = page.getByXPath("//li[@class='gridCol-s-12 gridCol-m-4 gridCol-l-4']");
然后我循环遍历该列表以提取标题:
String title = ((HtmlElement) htmlItem.getFirstByXPath(".//div[@class='filmItemCopy']")).asText();
String titleOnly = title.substring(0, title.indexOf("\n"));
我一直在检查 HTML
并且知道我需要定位:
<section class="gridRow section content">
<div class="navTabs">
<div class="navTabItem active" data-tab-item="#showing">
为了完成这个,我很确定我需要更改我的 List<HTMLElement>
以反映这一点,但我只是没有让它工作。我尝试了以下无济于事:
List<HtmlElement> itemList = page.getByXPath("//div[@class='navTabItem active']");
预期输出
{"title":"FOUR GOOD DAYS"}
{"title":"LIMBO"}
{"title":"DEMON SLAYER THE MOVIE: MUGEN TRAIN (SUBTITLED)"}
{"title":"DEMON SLAYER THE MOVIE: MUGEN TRAIN (DUBBED)"}
{"title":"STREET GANG: HOW WE GOT TO SESAME STREET"}
{"title":"TOGETHER TOGETHER"}
{"title":"NOMADLAND"}
{"title":"THE TRUFFLE HUNTERS"}
{"title":"THE FATHER"}
当前输出
{"title":"FOUR GOOD DAYS"}
{"title":"LIMBO"}
{"title":"DEMON SLAYER THE MOVIE: MUGEN TRAIN (SUBTITLED)"}
{"title":"DEMON SLAYER THE MOVIE: MUGEN TRAIN (DUBBED)"}
{"title":"STREET GANG: HOW WE GOT TO SESAME STREET"}
{"title":"TOGETHER TOGETHER"}
{"title":"NOMADLAND"}
{"title":"THE TRUFFLE HUNTERS"}
{"title":"THE FATHER"}
{"title":"DREAM HORSE"}
{"title":"FINAL ACCOUNT"}
{"title":"FINDING YOU"}
{"title":"THE DRY"}
{"title":"THE HUMAN FACTOR"}
{"title":"WRATH OF MAN"}
代码
SpectrumFilmItems.java
package org.example;
public class SpectrumFilmItems {
private String title;
public SpectrumFilmItems(String title) {
super();
this.title = title;
}
public String getTitle(){
return title;
}
public void setTitle(String title){
this.title = title;
}
}
SpectrumScraper.java
package org.example;
import java.util.List;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.gargoylesoftware.htmlunit.SilentCssErrorHandler;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class SpectrumScraper
{
public static void main( String[] args )
{
// GET request to obtain HTML content from the web server.
String baseUrl = "https://www.landmarktheatres.com/albany-ny/spectrum-8-theatres";
WebClient client = new WebClient();
client.setCssErrorHandler(new SilentCssErrorHandler());
client.getOptions().setCssEnabled(false);
client.getOptions().setJavaScriptEnabled(false);
try {
HtmlPage page = client.getPage(baseUrl);
List<HtmlElement> itemList = page.getByXPath("//li[@class='gridCol-s-12 gridCol-m-4 gridCol-l-4']");
if(itemList.isEmpty()){
System.out.println("No item found.");
}else {
for (HtmlElement htmlItem : itemList) {
String title = ((HtmlElement) htmlItem.getFirstByXPath(".//div[@class='filmItemCopy']")).asText();
String titleOnly = title.substring(0, title.indexOf("\n"));
SpectrumFilmItems filmItem = new SpectrumFilmItems(titleOnly);
ObjectMapper mapper = new ObjectMapper();
String jsonString = mapper.writeValueAsString(filmItem);
System.out.println(jsonString);
}
}
}
catch(Exception e) {
e.printStackTrace();
}
}
}
现有电影和未发行电影之间的一致差异是属性 data-film-session
和 data-film-exp
。仅当条目具有其中一个或两个属性时才添加到列表中。这是未经测试的,可能行不通,但这是朝着正确方向迈出的一步。
for (HtmlElement htmlItem : itemList) {
String dataFilmSession = htmlItem.getAttribute("data-film-session");
if (dataFilmSession.equals(DomElement.ATTRIBUTE_NOT_DEFINED) || dataFilmSession.equals(DomElement.ATTRIBUTE_VALUE_EMPTY)) {
continue;
}
// your original code
}
概览
我正在开展一个项目,通过网络抓取当地剧院网站上正在播放的电影。我的目标是最终通过 JSON 将这些信息(电影标题、电影描述等)嵌入到每天早上发送的电子邮件中,让我们知道正在播放什么,而无需实际访问他们的网站或下载他们的应用程序。
此项目的基础 URL:https://www.landmarktheatres.com/albany-ny/spectrum-8-theatres
问题
使用 htmlunit
我已经成功地从 base url. However, included in these titles are the upcoming films which are also provided in the base url HTML
.
我需要帮助来定位正确的 HTML
。我当前的代码使用 HtmlElement
列表:
List<HtmlElement> itemList = page.getByXPath("//li[@class='gridCol-s-12 gridCol-m-4 gridCol-l-4']");
然后我循环遍历该列表以提取标题:
String title = ((HtmlElement) htmlItem.getFirstByXPath(".//div[@class='filmItemCopy']")).asText();
String titleOnly = title.substring(0, title.indexOf("\n"));
我一直在检查 HTML
并且知道我需要定位:
<section class="gridRow section content">
<div class="navTabs">
<div class="navTabItem active" data-tab-item="#showing">
为了完成这个,我很确定我需要更改我的 List<HTMLElement>
以反映这一点,但我只是没有让它工作。我尝试了以下无济于事:
List<HtmlElement> itemList = page.getByXPath("//div[@class='navTabItem active']");
预期输出
{"title":"FOUR GOOD DAYS"}
{"title":"LIMBO"}
{"title":"DEMON SLAYER THE MOVIE: MUGEN TRAIN (SUBTITLED)"}
{"title":"DEMON SLAYER THE MOVIE: MUGEN TRAIN (DUBBED)"}
{"title":"STREET GANG: HOW WE GOT TO SESAME STREET"}
{"title":"TOGETHER TOGETHER"}
{"title":"NOMADLAND"}
{"title":"THE TRUFFLE HUNTERS"}
{"title":"THE FATHER"}
当前输出
{"title":"FOUR GOOD DAYS"}
{"title":"LIMBO"}
{"title":"DEMON SLAYER THE MOVIE: MUGEN TRAIN (SUBTITLED)"}
{"title":"DEMON SLAYER THE MOVIE: MUGEN TRAIN (DUBBED)"}
{"title":"STREET GANG: HOW WE GOT TO SESAME STREET"}
{"title":"TOGETHER TOGETHER"}
{"title":"NOMADLAND"}
{"title":"THE TRUFFLE HUNTERS"}
{"title":"THE FATHER"}
{"title":"DREAM HORSE"}
{"title":"FINAL ACCOUNT"}
{"title":"FINDING YOU"}
{"title":"THE DRY"}
{"title":"THE HUMAN FACTOR"}
{"title":"WRATH OF MAN"}
代码
SpectrumFilmItems.java
package org.example;
public class SpectrumFilmItems {
private String title;
public SpectrumFilmItems(String title) {
super();
this.title = title;
}
public String getTitle(){
return title;
}
public void setTitle(String title){
this.title = title;
}
}
SpectrumScraper.java
package org.example;
import java.util.List;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.gargoylesoftware.htmlunit.SilentCssErrorHandler;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class SpectrumScraper
{
public static void main( String[] args )
{
// GET request to obtain HTML content from the web server.
String baseUrl = "https://www.landmarktheatres.com/albany-ny/spectrum-8-theatres";
WebClient client = new WebClient();
client.setCssErrorHandler(new SilentCssErrorHandler());
client.getOptions().setCssEnabled(false);
client.getOptions().setJavaScriptEnabled(false);
try {
HtmlPage page = client.getPage(baseUrl);
List<HtmlElement> itemList = page.getByXPath("//li[@class='gridCol-s-12 gridCol-m-4 gridCol-l-4']");
if(itemList.isEmpty()){
System.out.println("No item found.");
}else {
for (HtmlElement htmlItem : itemList) {
String title = ((HtmlElement) htmlItem.getFirstByXPath(".//div[@class='filmItemCopy']")).asText();
String titleOnly = title.substring(0, title.indexOf("\n"));
SpectrumFilmItems filmItem = new SpectrumFilmItems(titleOnly);
ObjectMapper mapper = new ObjectMapper();
String jsonString = mapper.writeValueAsString(filmItem);
System.out.println(jsonString);
}
}
}
catch(Exception e) {
e.printStackTrace();
}
}
}
现有电影和未发行电影之间的一致差异是属性 data-film-session
和 data-film-exp
。仅当条目具有其中一个或两个属性时才添加到列表中。这是未经测试的,可能行不通,但这是朝着正确方向迈出的一步。
for (HtmlElement htmlItem : itemList) {
String dataFilmSession = htmlItem.getAttribute("data-film-session");
if (dataFilmSession.equals(DomElement.ATTRIBUTE_NOT_DEFINED) || dataFilmSession.equals(DomElement.ATTRIBUTE_VALUE_EMPTY)) {
continue;
}
// your original code
}