HTMLUnit 不适用于 Ajax/Javascript
HTMLUnit not working with Ajax/Javascript
我正在尝试从网页(显示搜索结果的页面)中提取 class 项目的数据。具体来说,就是这个页面:
我只想提取产品的标题。
我正在使用以下代码:
final WebClient webClient = new WebClient(BrowserVersion.CHROME);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
final HtmlPage page = webClient.getPage(itemPageURL);
int tries = 20; // Amount of tries to avoid infinite loop
while (tries > 0) {
tries--;
synchronized(page) {
page.wait(2000); // How often to check
}
}
int numThreads = webClient.waitForBackgroundJavaScript(1000000l);
PrintWriter pw = new PrintWriter("test-target-search.txt");
pw.println(page.asXml());
pw.close();
生成的页面没有网络浏览器上显示的产品信息。我想 AJAX 电话还没有完成? (虽然不确定。)
如有任何帮助,我们将不胜感激。谢谢!
您可以对此类任务使用 GET 请求。通过 URL 中的 "pageCount" 和 "offset" 参数控制页面,在检索页面后(下面的示例对一页执行此操作)您可以使用正则表达式或任何内容( JSON?) 提取标题。
public static void main(String[] args)
{
try
{
WebClient webClient = new WebClient();
URL url = new URL(
"http://tws.target.com/searchservice/item/search_results/v1/by_keyword?callback=getPlpResponse&navigation=true&category=55krw&searchTerm=&view_type=medium&sort_by=bestselling&faceted_value=&offset=60&pageCount=60&response_group=Items&isLeaf=true&parent_category_id=55kug&custom_price=false&min_price=from&max_price=to");
WebRequest requestSettings = new WebRequest(url, HttpMethod.GET);
requestSettings.setAdditionalHeader("Accept", "*/*");
requestSettings.setAdditionalHeader("Content-Type", "application/x-www-form-urlencoded; charset=UTF-8");
requestSettings.setAdditionalHeader("Referer", "http://www.target.com/c/xbox-one-games-video/-/N-55krw");
requestSettings.setAdditionalHeader("Accept-Language", "en-US,en;q=0.8");
requestSettings.setAdditionalHeader("Accept-Encoding", "gzip,deflate,sdch");
requestSettings.setAdditionalHeader("Accept-Charset", "ISO-8859-1,utf-8;q=0.7,*;q=0.3");
Page page = webClient.getPage(requestSettings);
System.out.println(page.getWebResponse().getContentAsString());
}
catch (Exception e)
{
e.printStackTrace();
}
}
我正在尝试从网页(显示搜索结果的页面)中提取 class 项目的数据。具体来说,就是这个页面:
我只想提取产品的标题。
我正在使用以下代码:
final WebClient webClient = new WebClient(BrowserVersion.CHROME);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
final HtmlPage page = webClient.getPage(itemPageURL);
int tries = 20; // Amount of tries to avoid infinite loop
while (tries > 0) {
tries--;
synchronized(page) {
page.wait(2000); // How often to check
}
}
int numThreads = webClient.waitForBackgroundJavaScript(1000000l);
PrintWriter pw = new PrintWriter("test-target-search.txt");
pw.println(page.asXml());
pw.close();
生成的页面没有网络浏览器上显示的产品信息。我想 AJAX 电话还没有完成? (虽然不确定。)
如有任何帮助,我们将不胜感激。谢谢!
您可以对此类任务使用 GET 请求。通过 URL 中的 "pageCount" 和 "offset" 参数控制页面,在检索页面后(下面的示例对一页执行此操作)您可以使用正则表达式或任何内容( JSON?) 提取标题。
public static void main(String[] args)
{
try
{
WebClient webClient = new WebClient();
URL url = new URL(
"http://tws.target.com/searchservice/item/search_results/v1/by_keyword?callback=getPlpResponse&navigation=true&category=55krw&searchTerm=&view_type=medium&sort_by=bestselling&faceted_value=&offset=60&pageCount=60&response_group=Items&isLeaf=true&parent_category_id=55kug&custom_price=false&min_price=from&max_price=to");
WebRequest requestSettings = new WebRequest(url, HttpMethod.GET);
requestSettings.setAdditionalHeader("Accept", "*/*");
requestSettings.setAdditionalHeader("Content-Type", "application/x-www-form-urlencoded; charset=UTF-8");
requestSettings.setAdditionalHeader("Referer", "http://www.target.com/c/xbox-one-games-video/-/N-55krw");
requestSettings.setAdditionalHeader("Accept-Language", "en-US,en;q=0.8");
requestSettings.setAdditionalHeader("Accept-Encoding", "gzip,deflate,sdch");
requestSettings.setAdditionalHeader("Accept-Charset", "ISO-8859-1,utf-8;q=0.7,*;q=0.3");
Page page = webClient.getPage(requestSettings);
System.out.println(page.getWebResponse().getContentAsString());
}
catch (Exception e)
{
e.printStackTrace();
}
}