Java Web Scraper 项目返回 null 而不是正常链接
Java Web Scraper project is returning null instead of normal links
使用 maven 作为 webscraper 的 htmlunit 依赖项。
主要问题是我的刮板 returns null 而不是 links。我做了一个项目class来设置和获取
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlAnchor;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import java.util.List;
public class Scraper {
private static final String searchUrl = "https://sfbay.craigslist.org/search/sss?query=iphone%208&sort=rel";
public static void main(String[] args){
WebClient client = new WebClient();
client.getOptions().setJavaScriptEnabled(false);
client.getOptions().setCssEnabled(false);
client.getOptions().setUseInsecureSSL(true);
HtmlPage page = client.getPage(searchUrl);
List<HtmlElement> items = page.getByXPath("//li[@class='result-row']");
for(HtmlElement htmlItem : items){
HtmlAnchor itemAnchor = ((HtmlAnchor)htmlItem.getFirstByXPath("//a[@class='result-image gallery']")); //itemAnchor gets the anchor specified by class result-image gallery//
Item item = new Item();
String link = itemAnchor.getHrefAttribute(); //link is extracted and initialized in string//
item.setUrl(link);
System.out.println(item.getUrl()); //why don't you work//
}
}
结果:
基本上是一行 null
*注意:在本例中,将 System.out.println(link) returns 放在一个 link 中,并在打印新行时重复使用相同的 link只是 link 'https://sfbay.craigslist.org/sby/mob/d/san-jose-iphone-plus-256-gb-black/7482411084.html' 一直向下
我是这个残酷世界的新手。任何帮助都是有用的。
编辑:为了以防万一,我将在此处包含依赖代码,并且 Item class 的代码可能不需要在此处,因为它只是 setUrl 和 getUrl 概述的 set 和 get 方法
<dependency>
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>2.60.0</version>
</dependency>
这在这里有效
public static void main(String[] args) throws IOException {
String url = "https://sfbay.craigslist.org/search/sss?query=iphone%208&sort=rel";
try (final WebClient webClient = new WebClient()) {
HtmlPage page = webClient.getPage(url);
// webClient.waitForBackgroundJavaScript(10_000);
List<HtmlElement> items = page.getByXPath("//li[@class='result-row']");
for(HtmlElement htmlItem : items){
HtmlAnchor itemAnchor = ((HtmlAnchor)htmlItem.getFirstByXPath("a[@class='result-image gallery']"));
if (itemAnchor != null) {
String link = itemAnchor.getHrefAttribute();
System.out.println("-> " + link);
}
}
}
}
产生类似
的东西
-> https://sfbay.craigslist.org/eby/pho/d/walnut-creek-original-new-defender/7470991009.html
-> https://sfbay.craigslist.org/eby/pho/d/walnut-creek-original-new-defender/7471913572.html
-> https://sfbay.craigslist.org/eby/pho/d/walnut-creek-original-new-defender/7471010388.html
....
使用 maven 作为 webscraper 的 htmlunit 依赖项。 主要问题是我的刮板 returns null 而不是 links。我做了一个项目class来设置和获取
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlAnchor;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import java.util.List;
public class Scraper {
private static final String searchUrl = "https://sfbay.craigslist.org/search/sss?query=iphone%208&sort=rel";
public static void main(String[] args){
WebClient client = new WebClient();
client.getOptions().setJavaScriptEnabled(false);
client.getOptions().setCssEnabled(false);
client.getOptions().setUseInsecureSSL(true);
HtmlPage page = client.getPage(searchUrl);
List<HtmlElement> items = page.getByXPath("//li[@class='result-row']");
for(HtmlElement htmlItem : items){
HtmlAnchor itemAnchor = ((HtmlAnchor)htmlItem.getFirstByXPath("//a[@class='result-image gallery']")); //itemAnchor gets the anchor specified by class result-image gallery//
Item item = new Item();
String link = itemAnchor.getHrefAttribute(); //link is extracted and initialized in string//
item.setUrl(link);
System.out.println(item.getUrl()); //why don't you work//
}
}
结果: 基本上是一行 null
*注意:在本例中,将 System.out.println(link) returns 放在一个 link 中,并在打印新行时重复使用相同的 link只是 link 'https://sfbay.craigslist.org/sby/mob/d/san-jose-iphone-plus-256-gb-black/7482411084.html' 一直向下
我是这个残酷世界的新手。任何帮助都是有用的。 编辑:为了以防万一,我将在此处包含依赖代码,并且 Item class 的代码可能不需要在此处,因为它只是 setUrl 和 getUrl 概述的 set 和 get 方法
<dependency>
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>2.60.0</version>
</dependency>
这在这里有效
public static void main(String[] args) throws IOException {
String url = "https://sfbay.craigslist.org/search/sss?query=iphone%208&sort=rel";
try (final WebClient webClient = new WebClient()) {
HtmlPage page = webClient.getPage(url);
// webClient.waitForBackgroundJavaScript(10_000);
List<HtmlElement> items = page.getByXPath("//li[@class='result-row']");
for(HtmlElement htmlItem : items){
HtmlAnchor itemAnchor = ((HtmlAnchor)htmlItem.getFirstByXPath("a[@class='result-image gallery']"));
if (itemAnchor != null) {
String link = itemAnchor.getHrefAttribute();
System.out.println("-> " + link);
}
}
}
}
产生类似
的东西-> https://sfbay.craigslist.org/eby/pho/d/walnut-creek-original-new-defender/7470991009.html
-> https://sfbay.craigslist.org/eby/pho/d/walnut-creek-original-new-defender/7471913572.html
-> https://sfbay.craigslist.org/eby/pho/d/walnut-creek-original-new-defender/7471010388.html
....