用于复杂负载的 Scrapy FormRequest

Scrapy FormRequest for a complicated payload

在一个包含律师工作详细信息的网站中,我试图通过这个 4 层算法来抓取信息,我需要执行两个 FormRequest:

  1. 访问包含提交律师姓名请求的搜索框的link(图片1)(“ali”作为姓名查询传递)
  2. 通过FormRequest发起带有payload的搜索请求,从而访问找到律师的页面(image2)
  3. 通过 FormRequest(图片 3)连续点击放大镜按钮到达每个律师详细信息的页面(此处发生错误)
  4. 解析图 3 中每个律师的数据点

问题:我的第一个 FormRequest 有效,我可以联系到律师名单。然后遇到两个问题:

  1. 问题 1:我的 for 循环只适用于找到的第一位律师。
  2. 问题 2:第二个 FormRequest 不起作用。

我的见解:检查每个请求的律师的第二个 FormRequest 所需的有效载荷,将所有值数字作为一个批量添加到有效载荷以及请求的律师的索引号。

我真的应该传递每个请求的所有值吗?如何发送正确的有效载荷? 在我的代码中,我试图将特定律师的价值和索引作为有效负载发送,但没有成功。 我应该使用什么样的代码来获取列表中所有律师的详细信息?

import scrapy
import json
import logging
from scrapy.utils.response import open_in_browser


class BarolevSpider(scrapy.Spider):
    name = 'barolev'
    allowed_domains = ['www.barobirlik.org.tr']
    start_urls = ['https://www.barobirlik.org.tr/levhaiframe/342']

    headers = {
            'authority': 'www.barobirlik.org.tr',
            'path': '/Islemler/LevhaIFrame',
            'scheme': 'https',
            'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
            'accept-encoding': 'gzip, deflate, br',
            'accept-language': 'en-US,en;q=0.9',
            'cache-control': 'max-age=0',
            'content-type': 'application/x-www-form-urlencoded',
            'origin': 'https://www.barobirlik.org.tr',
            'referer': 'https://www.barobirlik.org.tr/levhaiframe/342',
            'sec-fetch-dest': 'iframe',
            'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'
                }
    url = 'https://www.barobirlik.org.tr/Islemler/LevhaIFrame'

    def parse(self, response):

        #search inquiry body
        body = {
            'ad': 'ali',
            'soyad': '',
            'baroId': '342',
            'tip': 'Avukat',
                }
        logging.log(logging.INFO, "Search page opened")
        search = scrapy.FormRequest(self.url, callback = self.list_parser, method = "POST", headers=self.headers, formdata = body)

        yield search

    def list_parser(self, response):

        #open_in_browser(response)
        logging.log(logging.INFO, "Lawyers list's page opened")

        data = response.xpath("//div/table/tbody")
        lawyer_values = response.xpath("//input[@value]/text()")

        for counter, lawyer in enumerate(data):
            lic_number= response.xpath("//tbody/tr/td[3]/text()").get() #lawyer's licence number
            name = response.xpath(".//tbody/tr/td[4]/text()").get()
            surname = response.xpath("//tbody/tr/td[5]/text()").get()
            value = response.xpath(".//tbody/tr/td/input/@value").get()
            payload = {
                        counter : value,
                        'detay' : counter
                        }


            yield scrapy.FormRequest(self.url, callback= self.lawyer_parser, method = "POST", headers =self.headers, formdata = payload )




    def lawyer_parser(self, response):
        open_in_browser(response)
        pass

Scrapy 日志:

2021-12-27 02:19:23 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: baro2)
2021-12-27 02:19:23 [scrapy.utils.log] INFO: Versions: lxml 4.6.4.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.9.9 (main, Nov 16 2021, 10:24:31) - [GCC 11.2.0], pyOpenSSL 21.0.0 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 3.4.8, Platform Linux-5.14.0-kali4-amd64-x86_64-with-glibc2.32
2021-12-27 02:19:23 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2021-12-27 02:19:23 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'baro2',
 'NEWSPIDER_MODULE': 'baro2.spiders',
 'SPIDER_MODULES': ['baro2.spiders']}
2021-12-27 02:19:23 [scrapy.extensions.telnet] INFO: Telnet Password: 05273ea1c26c6378
2021-12-27 02:19:23 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2021-12-27 02:19:23 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-12-27 02:19:23 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-12-27 02:19:23 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2021-12-27 02:19:23 [scrapy.core.engine] INFO: Spider opened
2021-12-27 02:19:23 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-12-27 02:19:23 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-12-27 02:19:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.barobirlik.org.tr/levhaiframe/342> (referer: None)
2021-12-27 02:19:24 [root] INFO: Search page opened
2021-12-27 02:19:26 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.barobirlik.org.tr/levhaiframe/342> (referer: https://www.barobirlik.org.tr/levhaiframe/342)
2021-12-27 02:19:26 [root] INFO: Lawyers list's page opened
2021-12-27 02:19:26 [scrapy.core.scraper] ERROR: Spider error processing <POST https://www.barobirlik.org.tr/levhaiframe/342> (referer: https://www.barobirlik.org.tr/levhaiframe/342)
Traceback (most recent call last):
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/utils/defer.py", line 120, in iter_errback
    yield next(it)
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/utils/python.py", line 353, in __next__
    return next(self.data)
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/utils/python.py", line 353, in __next__
    return next(self.data)
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/core/spidermw.py", line 56, in _evaluate_iterable
    for r in iterable:
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/core/spidermw.py", line 56, in _evaluate_iterable
    for r in iterable:
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/spidermiddlewares/referer.py", line 342, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/core/spidermw.py", line 56, in _evaluate_iterable
    for r in iterable:
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/spidermiddlewares/urllength.py", line 40, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/core/spidermw.py", line 56, in _evaluate_iterable
    for r in iterable:
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/core/spidermw.py", line 56, in _evaluate_iterable
    for r in iterable:
  File "/home/draco/scraping/scrapyyy/baro2/baro2/spiders/barolev.py", line 61, in list_parser
    yield scrapy.FormRequest(self.url, callback= self.lawyer_parser, method = "POST", headers =self.headers, formdata = payload )
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/http/request/form.py", line 31, in __init__
    querystr = _urlencode(items, self.encoding)
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/http/request/form.py", line 71, in _urlencode
    values = [(to_bytes(k, enc), to_bytes(v, enc))
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/http/request/form.py", line 71, in <listcomp>
    values = [(to_bytes(k, enc), to_bytes(v, enc))
  File "/home/draco/.local/lib/python3.9/site-packages/scrapy/utils/python.py", line 106, in to_bytes
    raise TypeError('to_bytes must receive a str or bytes '
TypeError: to_bytes must receive a str or bytes object, got int
2021-12-27 02:19:26 [scrapy.core.engine] INFO: Closing spider (finished)
2021-12-27 02:19:26 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 943,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 1,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 8193,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 2.788149,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 12, 26, 22, 19, 26, 603568),
 'httpcompression/response_bytes': 77405,
 'httpcompression/response_count': 2,
 'log_count/DEBUG': 2,
 'log_count/ERROR': 1,
 'log_count/INFO': 12,
 'memusage/max': 58245120,
 'memusage/startup': 58245120,
 'request_depth_max': 1,
 'response_received_count': 2,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'spider_exceptions/TypeError': 1,
 'start_time': datetime.datetime(2021, 12, 26, 22, 19, 23, 815419)}
2021-12-27 02:19:26 [scrapy.core.engine] INFO: Spider closed (finished)

提前致谢。

该网站使用了某种保护措施,此代码有时会起作用,一旦检测到,您将不得不等待一段时间,直到他们的反机器人清除内容或改用代理:

导入这个:

from re import sub

主要代码如下:

def list_parser(self, response):
    logging.log(logging.INFO, "Lawyers list's page opened")
    post_dict = []
    for x in response.xpath("//div[@class='table-responsive']/table//tbody/tr"):
        counter_val = x.xpath("./td[1]/button[@name='detay']/@value").get()
        profile_id = x.xpath(f"./td[1]/input[@name='{counter_val}']/@value").get()
        post_dict.append({counter_val: profile_id})

    post_str = "detay=1"
    for x in post_dict:
        _k = list(x.keys())[0]
        _v = list(x.values())[0]
        post_str += f"&{_k}={_v}"

    counter = 0
    for x in post_dict:
        counter += 1
        new_post_str = sub(r"detay=(\d+)", f"detay={counter}", post_str)
        yield scrapy.Request(self.url, callback=self.lawyer_parser, method="POST", headers=self.headers, body=new_post_str)

def lawyer_parser(self, response):
    title = response.xpath("//title").get()
    print (title)

邮递员的工作方式如下:

POST https://www.barobirlik.org.tr/Islemler/LevhaIFrame
content-type: application/x-www-form-urlencoded
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36

detay=1&1=22300955972&2=34294103512&3=47650810900&4=31457011912&5=21989082740&6=13031762182&7=15098226362&8=29465540500&9=19705078726&10=52327454912&11=38077312146&12=15053874294&13=43504737980&14=15757111312&15=49528751810&16=22454512274&17=20735008140&18=14759958840&19=25331632218&20=23717241030&21=26188728710&22=22144254712&23=40831499468&24=45721682982&25=17524950834&26=20443640946&27=43375067858&28=25958522862&29=55642518444&30=15958205476&31=67798055186&32=53125619114&33=19700638734&34=47962570370&35=61324031392&36=62770087018&37=42020121822&38=51916778712&39=63445269098&40=20968442094&41=37150116442&42=19825577734&43=34690100306&44=61117020312&45=47005726450&46=32674457828&47=28318970648&48=46675921516&49=43891755192&50=63547258662&51=14786321440&52=49882225538&53=15026163486&54=71761039790&55=19769136608&56=13592195252&57=26261321722&58=67759140784&59=15443072422&60=14358051782&61=35387076706&62=31600743554&63=29132448610&64=50134148826&65=13741738194&66=46951136320&67=48070816756&68=40531941366&69=25712046412&70=50041757234&71=34792727562&72=31373015384&73=19936276632&74=11726671214&75=50752736252&76=10825141642&77=41269955922&78=22703604262&79=16439033762&80=38872658142&81=22313706936&82=20122397994&83=22807774748&84=25522741996&85=23408214992&86=12125473522&87=34900898696&88=51511402166&89=57145071426&90=27010985158&91=43045344210&92=15517773668&93=27065110338