如果 link 保持不变，如何在抓取时进入下一页？

Question

我最近在研究网络抓取，但我被卡住了。我需要从下一页抓取数据，但只有一个可点击的按钮，link 保持不变。所以我的问题是，如果 url 保持不变，我如何将 link 提取到下一页？我正在抓取的网络是 http://esg.krx.co.kr/contents/02/02020000/ESG02020000.jsp

到目前为止我的代码：

import scrapy
import json

class EsgKrx1Spider(scrapy.Spider):
name = 'esg_krx1'
allowed_domains = ['esg.krx.co.kr/contents/02/02020000/ESG02020000.jsp']
start_urls = ['http://esg.krx.co.kr/contents/02/02020000/ESG02020000.jsp/']

def start_requests(self):
    #sending a post request to the web
    return [scrapy.FormRequest("http://esg.krx.co.kr/contents/99/ESG99000001.jspx",
                               formdata={'sch_com_nm': '',
                                         'sch_yy': '2021',
                                         'pagePath': '/contents/02/02020000/ESG02020000.jsp',
                                         'code': '02/02020000/esg02020000',
                                         'pageFirstCall': 'Y'},
                               callback=self.parse)]

def parse(self, response):
    dict_data = json.loads(response.text)

    #looping in the result and assigning the company name
    for i in dict_data['result']:
        company_name = i['com_abbrv']
        compay_share_id = i['isu_cd']
        print(company_name, compay_share_id)

所以现在我只需要从第一页获取信息。现在我必须转到下一页。有人可以解释一下我该怎么做吗？

Answer 1

我发现将 scrapy_splash 与 javascript 繁重的网站（例如您正在使用的网站）集成起来更容易，因为它们在发送请求时通常需要一段时间才能加载。因此，我创建了一个简单的 lua 脚本来加载站点，然后解析所需的信息。

您会发现有效负载包括您所在的当前页面；通过迭代此数字直到网站上的最后一页，然后您可以抓取下一页。

因为像这样的网站会很快阻止您，所以添加计时器和 download-delays 非常重要，这样它们就不会阻止您。

这是一个有效的抓取工具：

import scrapy
from scrapy_splash import SplashRequest
import json

script = """
function main(splash, args)
  assert(splash:go(args.url))
  assert(splash:wait(7))
  return splash:html()
end
"""
class KorenSiteSpider(scrapy.Spider):
    name = 'k-site'
    start_urls = ['https://esg.krx.co.kr/contents/02/02020000/ESG02020000.jsp']
    custom_settings = {
        'USER_AGENT':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36',
        'DOWNLOAD_DELAY':3
    }

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(
                url = url,
                callback = self.parse, 
                endpoint='execute',
                args = {'lua_source':script}
            )

    def parse(self, response):
        for i in range(1, 78, 1):
            yield scrapy.FormRequest(
                url = 'https://esg.krx.co.kr/contents/99/ESG99000001.jspx',
                method = 'POST',
                formdata = {
                            'sch_com_nm': '',
                            'sch_yy': '2021',
                            'pagePath': '/contents/02/02020000/ESG02020000.jsp',
                            'code': '02/02020000/esg02020000',
                            'curPage': str(i)
                            },
                callback = self.parse_json
            )

    def parse_json(self, response):
        dict_data = json.loads(response.text)

    #looping in the result and assigning the company name
        for i in dict_data['result']:
            company_name = i['com_abbrv']
            company_share_id = i['isu_cd']
            yield {
                'company:name':company_name,
                'company_share_id':company_share_id
            }

输出：

2022-02-17 13:55:04 [scrapy.core.scraper] DEBUG: Scraped from <200 https://esg.krx.co.kr/contents/99/ESG99000001.jspx>
{'company:name': '페이퍼코리아', 'company_share_id': '001020'}
2022-02-17 13:55:04 [scrapy.core.scraper] DEBUG: Scraped from <200 https://esg.krx.co.kr/contents/99/ESG99000001.jspx>
{'company:name': '평화산업', 'company_share_id': '090080'}
2022-02-17 13:55:04 [scrapy.core.scraper] DEBUG: Scraped from <200 https://esg.krx.co.kr/contents/99/ESG99000001.jspx>
{'company:name': '평화홀딩스', 'company_share_id': '010770'}
2022-02-17 13:55:04 [scrapy.core.scraper] DEBUG: Scraped from <200 https://esg.krx.co.kr/contents/99/ESG99000001.jspx>
{'company:name': '포스코', 'company_share_id': '005490'}
2022-02-17 13:55:04 [scrapy.core.scraper] DEBUG: Scraped from <200 https://esg.krx.co.kr/contents/99/ESG99000001.jspx>
{'company:name': '포스코강판', 'company_share_id': '058430'}

Answer 2

您正在抓取的网站公开了一个 API，您可以直接调用它而不是使用 splash。如果您检查网络选项卡，您将看到 POST 请求被发送到服务器。

参见下面的示例代码。我已经对总页数进行了硬编码，但您可以找到一种自动获取总数的方法，而不是对值进行硬编码。

注意 response.follow 的用法。它会自动处理 cookie 和其他 headers。

import scrapy

class EsgKrx1Spider(scrapy.Spider):
    name = 'esg_krx1'
    allowed_domains = ['esg.krx.co.kr']
    start_urls = ['http://esg.krx.co.kr/contents/02/02020000/ESG02020000.jsp']
    custom_settings = {
        "USER_AGENT": 'Mozilla/5.0 (X11; Linux x86_64; rv:97.0) Gecko/20100101 Firefox/97.0'
    }

    def parse(self, response):
        #send a post request to the api
        url = "http://esg.krx.co.kr/contents/99/ESG99000001.jspx"
        
        headers = {
        'Accept': 'application/json, text/javascript, */*; q=0.01',
        'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
        }

        total_pages = 77
        for page in range(total_pages):
            payload = f"sch_com_nm=&sch_yy=2021&pagePath=%2Fcontents%2F02%2F02020000%2FESG02020000.jsp&code=02%2F02020000%2Fesg02020000&curPage={page+1}"
            yield response.follow(url=url, method='POST', callback=self.parse_result, headers=headers, body=payload)

    def parse_result(self, response):

        # #looping in the result and assigning the company name
        for item in response.json().get('result'):
            yield {
                'company_name': item.get('com_abbrv'),
                'compay_share_id': item.get('isu_cd')
            }

如果 link 保持不变，如何在抓取时进入下一页？

How to get to the next page while scraping if the link stays the same?

scrapy

web-scraping