HtmlUnit 从 Div 抓取 Xpath

HtmlUnit Scraping Xpath from Div

我正在尝试抓取 google 电影页面的内容,我想要剧院的名称、地址和时间。 正如您在 google 电影页面中看到的那样,该信息的每个块都在 div 中,其中有一个 class 命名的剧院,在 div 里面有名称、地址和每个剧院的时间。

所以我所做的是使用 htmlunit 提取剧院列表 divs:

List<HtmlDivision> div =  (List<HtmlDivision>) page.getByXPath("//div[@class='theater']");

当打印列表的内容时,我得到了预期的结果:

System.out.println(div.get(0).asText());

Regal Battery Park Stadium 11
102 North End Avenue, New York, NY
1:00‎ ‎4:10‎ ‎7:20‎ ‎10:35pm‎

现在我想将此信息拆分为名称、地址和时间,问题是当我这样做时:

System.out.println("Theater " + div.get(0).getByXPath("//div[@class='name']/a/text()"));

结果是页面中每个剧院的名称:

Theater [Regal Battery Park Stadium 11, UA Court Street Stadium 12 & RPX, Regal Union Square Stadium 14, Cobble Hill Cinemas, Bow Tie Chelsea Cinemas, AMC Newport Centre 11, Regal Battery Park Stadium 11, AMC Village 7, UA Court Street Stadium 12 & RPX, Cobble Hill Cinemas, AMC Loews 19th St. East 6, AMC Newport Centre 11, Regal Battery Park Stadium 11, UA Court Street Stadium 12 & RPX, Regal Union Square Stadium 14, Bow Tie Chelsea Cinemas, AMC Newport Centre 11, AMC Loews 34th Street 14, Regal Battery Park Stadium 11, UA Court Street Stadium 12 & RPX, City Cinemas Village East Cinema, AMC Loews 19th St. East 6, AMC Newport Centre 11, AMC Loews 34th Street 14, Regal Battery Park Stadium 11, UA Court Street Stadium 12 & RPX, Regal Union Square Stadium 14, Bow Tie Chelsea Cinemas, AMC Newport Centre 11, AMC Loews 34th Street 14, Regal Battery Park Stadium 11, UA Court Street Stadium 12 & RPX, Regal Union Square Stadium 14, Cobble Hill Cinemas, AMC Newport Centre 11, AMC Loews 34th Street 14, Regal Battery Park Stadium 11, UA Court Street Stadium 12 & RPX, Regal Union Square Stadium 14, Cobble Hill Cinemas, Bow Tie Chelsea Cinemas, AMC Newport Centre 11, Regal Battery Park Stadium 11, UA Court Street Stadium 12 & RPX, City Cinemas Village East Cinema, AMC Loews Kips Bay 15, Regal E-Walk Stadium 13 & RPX, Pavilion Cinema, AMC Village 7, UA Court Street Stadium 12 & RPX, AMC Loews 19th St. East 6, AMC Newport Centre 11, AMC Loews 34th Street 14, AMC Loews Kips Bay 15, Regal E-Walk Stadium 13 & RPX, Frank Theatres - South Cove Stadium 12]

如果我在甚至没有该信息的对象中执行 getByXpath,我怎么可能获得所有剧院?

您需要在 XPath 的开头添加一个点 (.),以表明它是相对于当前上下文元素的,在本例中是第一个 div (div.get(0)).否则 XPath 将忽略上下文元素并从根开始搜索匹配元素:

div.get(0).getByXPath(".//div[@class='name']/a/text()")