如何在抓取搜索查询的所有数据的同时从一页移动到另一页

Question

     package scraper;

     import org.jsoup.Jsoup;
     import org.jsoup.nodes.Document;
     import org.jsoup.nodes.Element;
     import org.jsoup.select.Elements;

         public class Scraper {

             public static void main(String[] args) throws Exception {


                final Document document =    Jsoup.connect("https://www.indeed.com.pk/jobs?q=java&l=").userAgent("Mozilla").cookie("auth", "token").timeout(3000) .get();

        Elements rows = document.select("div.row.result") ;

         for (Element row : rows){
           Elements innerDivs = row.select("div");
            String header = innerDivs.get(1).text();
              String content = innerDivs.get(2).text();
                 System.out.println("header = "+header+ " -> "+content);
               }
             }
           }

在此代码中，我正在抓取搜索查询 Java 的作业，但它仅抓取当前页面（代码中搜索查询的 link）。我想删除与 Java

相关的所有页面

请帮忙

Answer 1

您需要找到分页 div，其中有 .pagination class 然后 select 第一个内部 link 作为第一页， second inner link 用于第二页，依此类推

这是您如何执行此操作的示例。您需要对其进行修改以加载正确的页面：

Elements pages = document.select("div.pagination a");
for(Element page : pages) {
    // Load the next page
    Document nextPage = Jsoup.connect(pages.attr("href"));
    ...
}

工作示例：

package scraper;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class Scraper {

    public static void main(String[] args) throws Exception {
        final Document document = 
                Jsoup.connect("https://www.indeed.com.pk/jobs?q=java&l=")
                .userAgent("Mozilla")
                .cookie("auth", "token")
                .timeout(3000)
                .get();
        scrape(document);

        // Move to the next page
        Element page = document.select("div.pagination a").get(1);
        System.out.println("Page link: " + page.attr("href"));
        Document pageDoc = Jsoup.connect(page.attr("abs:href")).get();
        scrape(pageDoc);
    }

    public static void scrape(Document document) {
        Elements rows = document.select("div.row.result") ;

        for (Element row : rows) {
            Elements innerDivs = row.select("div");
            String header = innerDivs.get(1).text();
            String content = innerDivs.get(2).text();
            System.out.println("header = "+header+ " -> "+content);
        }
    }
}

如何在抓取搜索查询的所有数据的同时从一页移动到另一页

How to move from one page to another while scraping all the data of searched query

java

screen-scraping

web-scraping

jsoup