插入一段代码,点击一个按钮,然后用 Scrapy 提取结果

Insert a code, click on a button and extract the result with Scrapy

我声明我从未使用过 Scrapy(因此我什至不知道它是否是正确的工具)。

在网站 https://www.ufficiocamerale.it/ 上,我有兴趣在“INSERISCI LA PARTITA IVA/RAGIONE SOCIALE”栏中输入一个 11 位数字代码(例如 06655971007),然后点击“CERCA” ”。然后我想将结果 HTML 保存在一个变量中,稍后我将使用 BeautifulSoup 进行分析(我不应该有任何问题)。 那么,我该如何做第一部分呢?

我想是这样的:

import scrapy

class Extraction(scrapy.Spider):

    def start_requests(self):
        url = "https://www.ufficiocamerale.it/"
        # To enter data
        yield scrapy.FormRequest(url=url, formdata={...}, callback=self.parse)
        # To click the button
        # some code

    def parse(self, response):
        print(response.body)

这些是搜索栏和按钮的 HTML:

<input type="search" name="search_input" class="autocomplete form-control" onchange="if (!window.__cfRLUnblockHandlers) return false; checkPartitaIva()" onkeyup="if (!window.__cfRLUnblockHandlers) return false; checkPartitaIva()" id="search_input" placeholder=" " value="">

<button onclick="if (!window.__cfRLUnblockHandlers) return false; dataLayer.push({'event': 'trova azienda'});" type="submit" class="btn btn-primary btn-sm text-uppercase">Cerca</button>

它使用 JavaScript 生成一些元素,因此使用 Selenium

会更简单
from selenium import webdriver
import time

url =  'https://www.ufficiocamerale.it/'

driver = webdriver.Firefox()
driver.get(url)

time.sleep(5)  # JavaScript needs time to load code

item = driver.find_element_by_xpath('//form[@id="formRicercaAzienda"]//input[@id="search_input"]')
#item = driver.find_element_by_id('search_input')
item.send_keys('06655971007')

time.sleep(1)

button = driver.find_element_by_xpath('//form[@id="formRicercaAzienda"]//p//button[@type="submit"]')
button.click()

time.sleep(5)  # JavaScript needs time to load code

item = driver.find_element_by_tag_name('h1')
print(item.text)
print('---')

all_items = driver.find_elements_by_xpath('//ul[@id="first-group"]/li')
for item in all_items:
    if '@' in item.text:
        print(item.text, '<<< found email:', item.text.split(' ')[1])
    else:
        print(item.text)
print('---')

结果:

DATI DELLA SOCIETÀ - ENEL ENERGIA S.P.A.
---
Partita IVA: 06655971007 - Codice Fiscale: 06655971007
Rag. Sociale: ENEL ENERGIA S.P.A.
Indirizzo: VIALE REGINA MARGHERITA 125 - 00198 - ROMA
Rea: 1150724
PEC: enelenergia@pec.enel.it <<< found email: enelenergia@pec.enel.it
Fatturato: € 13.032.695.000,00 (2020)
ACQUISTA BILANCIO
Dipendenti : 1666 (2021)
---