如何在抓取搜索查询的所有数据的同时从一页移动到另一页
How to move from one page to another while scraping all the data of searched query
package scraper;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Scraper {
public static void main(String[] args) throws Exception {
final Document document = Jsoup.connect("https://www.indeed.com.pk/jobs?q=java&l=").userAgent("Mozilla").cookie("auth", "token").timeout(3000) .get();
Elements rows = document.select("div.row.result") ;
for (Element row : rows){
Elements innerDivs = row.select("div");
String header = innerDivs.get(1).text();
String content = innerDivs.get(2).text();
System.out.println("header = "+header+ " -> "+content);
}
}
}
在此代码中,我正在抓取搜索查询 Java 的作业,但它仅抓取当前页面(代码中搜索查询的 link)。我想删除与 Java
相关的所有页面
请帮忙
您需要找到分页 div,其中有 .pagination
class 然后 select 第一个内部 link 作为第一页, second inner link 用于第二页,依此类推
这是您如何执行此操作的示例。您需要对其进行修改以加载正确的页面:
Elements pages = document.select("div.pagination a");
for(Element page : pages) {
// Load the next page
Document nextPage = Jsoup.connect(pages.attr("href"));
...
}
工作示例:
package scraper;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Scraper {
public static void main(String[] args) throws Exception {
final Document document =
Jsoup.connect("https://www.indeed.com.pk/jobs?q=java&l=")
.userAgent("Mozilla")
.cookie("auth", "token")
.timeout(3000)
.get();
scrape(document);
// Move to the next page
Element page = document.select("div.pagination a").get(1);
System.out.println("Page link: " + page.attr("href"));
Document pageDoc = Jsoup.connect(page.attr("abs:href")).get();
scrape(pageDoc);
}
public static void scrape(Document document) {
Elements rows = document.select("div.row.result") ;
for (Element row : rows) {
Elements innerDivs = row.select("div");
String header = innerDivs.get(1).text();
String content = innerDivs.get(2).text();
System.out.println("header = "+header+ " -> "+content);
}
}
}
package scraper;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Scraper {
public static void main(String[] args) throws Exception {
final Document document = Jsoup.connect("https://www.indeed.com.pk/jobs?q=java&l=").userAgent("Mozilla").cookie("auth", "token").timeout(3000) .get();
Elements rows = document.select("div.row.result") ;
for (Element row : rows){
Elements innerDivs = row.select("div");
String header = innerDivs.get(1).text();
String content = innerDivs.get(2).text();
System.out.println("header = "+header+ " -> "+content);
}
}
}
在此代码中,我正在抓取搜索查询 Java 的作业,但它仅抓取当前页面(代码中搜索查询的 link)。我想删除与 Java
相关的所有页面请帮忙
您需要找到分页 div,其中有 .pagination
class 然后 select 第一个内部 link 作为第一页, second inner link 用于第二页,依此类推
这是您如何执行此操作的示例。您需要对其进行修改以加载正确的页面:
Elements pages = document.select("div.pagination a");
for(Element page : pages) {
// Load the next page
Document nextPage = Jsoup.connect(pages.attr("href"));
...
}
工作示例:
package scraper;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Scraper {
public static void main(String[] args) throws Exception {
final Document document =
Jsoup.connect("https://www.indeed.com.pk/jobs?q=java&l=")
.userAgent("Mozilla")
.cookie("auth", "token")
.timeout(3000)
.get();
scrape(document);
// Move to the next page
Element page = document.select("div.pagination a").get(1);
System.out.println("Page link: " + page.attr("href"));
Document pageDoc = Jsoup.connect(page.attr("abs:href")).get();
scrape(pageDoc);
}
public static void scrape(Document document) {
Elements rows = document.select("div.row.result") ;
for (Element row : rows) {
Elements innerDivs = row.select("div");
String header = innerDivs.get(1).text();
String content = innerDivs.get(2).text();
System.out.println("header = "+header+ " -> "+content);
}
}
}