如何使用 Scrapy 和 Selenium 从使用 javascript 和 php 的网站获取数据?
How to use Scrapy and Selenium to get data from a website which uses javascript and php?
我正在尝试从提供事故信息的网站获取数据。我为此使用了 Scrapy 和 Selenium,但它不起作用。我对此很陌生,并试图了解发生了什么。我在一个 venv 中同时安装了 Scrapy 和 Selenium。网站上的结构有点老,很难理解。
如有任何帮助,我们将不胜感激!
我使用的是 Firefox,所以在设置中我使用了这个:
SELENIUM_DRIVER_NAME = 'firefox'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('geckodriver.exe')
SELENIUM_DRIVER_ARGUMENTS=\['-headless'\] \
我的代码如下所示:
import scrapy
from selenium.webdriver import firefox
from http.server import executable
from lib2to3.pgen2 import driver
from scrapy.utils.project import get_project_settings
class MeldingenSpider(scrapy.Spider):
name = '112meldingen'
def start_requests(self):
settings = get_project_settings
driver_path = settings.get('SELENIUM_DRIVER_EXECUTABLE_PATH')
driver = firefox(executable_path=driver_path)
driver.get('http://ftp.112meldingen.nl/index.php')
xpath = '//*[@id="divContentAlerts"]'
link_elements = driver.find_elements_by_xpath(xpath)
def parse(self, response):
articles = response.css('table::attr(id.alerts)')
for article in articles:
#if "haven" in article.css('div.title a::text').get():
yield {
'headline': article.css('td.bold a::text').get() ,
'timestamp': article.css('td.bold span').get(),
'location' : article.xpath('td > td').get()[3]
}
您可以在 SeleniumRequest
的帮助下抓取 url。
脚本:
import scrapy
from scrapy import Selector
from scrapy_selenium import SeleniumRequest
#from selenium.common.exceptions import NoSuchElementException
class MeldingenSpider(scrapy.Spider):
name = 'dingen'
responses = []
def start_requests(self):
yield SeleniumRequest(
url='http://ftp.112meldingen.nl/index.php',
callback=self.parse
)
def parse(self, response):
driver = response.meta['driver']
intial_page = driver.page_source
self.responses.append(intial_page)
driver.implicitly_wait(2)
for resp in self.responses:
r = Selector(text=resp)
articles = r.css('table#alerts')
for article in articles:
#if "haven" in article.css('div.title a::text').get():
yield {
'headline': article.css('td.bold a::text').get() ,
'timestamp': article.css('td.bold span::text').get().replace('\xa0\xa021',''),
'location' : [x.replace('\xa0',' ') for x in article.xpath('.//tr/td[@class="bold center"]/following-sibling::td//text()').getall()][-1]
}
您必须更改 settings.py 文件中的以下指令
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Middleware
DOWNLOADER_MIDDLEWARES = {
'scrapy_selenium.SeleniumMiddleware': 800
}
# Selenium
from shutil import which
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
# '--headless' if using chrome instead of firefox
SELENIUM_DRIVER_ARGUMENTS = ['--headless']
输出:
{'headline': 'B2 ', 'timestamp': '18:42:39-03-22', 'location': '.'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': 'A1 BAARLE-NASSAU RIT: 37170 (DIRECTE INZET: JA)', 'timestamp': '18:42:22-03-22', 'location': '1220499 Monitor Regionale Ambulancevoorziening Midden- en West-Brabant'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': 'A2 CALLENBURGHPLANTSOEN VOORSC DIRECTE INZET 16185', 'timestamp': '18:42:15-03-22', 'location': '1523185 Ambulance-16-185 Hollands Midden'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': 'A2 DP5 WESTLAND GALGEPAD NAALDW ', 'timestamp': '18:42:00-03-22', 'location': '.'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': 'A2 ', 'timestamp': '18:41:31-03-22', 'location': '0120999 Monitor Regionale Ambulancevoorziening Amsterdam-Amstelland'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': 'A2 ', 'timestamp': '18:41:09-03-22', 'location': '1123128 Ambulance-22-128 Helmond'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': 'A1 AMBU 17104 VERBOOMSTRAAT 3082JC ROTTERDAM ROTTDM BON 40895', 'timestamp': '18:40:43-03-22', 'location': '1420999 Monitor Regionale Ambulancevoorziening Rotterdam-Rijnmond'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': 'A2 BEST RIT: 31448', 'timestamp': '18:40:20-03-22', 'location': '1123124 Ambulance-22-124 Helmond'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': 'A2 MIDDELBURG RIT: 37168 (DIRECTE INZET: JA)', 'timestamp': '18:40:11-03-22', 'location': '1320101 Ambulance-19-101 Goes'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': None, 'timestamp': '18:39:49-03-22', 'location': '1420999 Monitor Regionale Ambulancevoorziening Rotterdam-Rijnmond'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': 'ONGEVAL MATERIEEL NIEUWE GRACHT HAARLEM', 'timestamp': '18:39:49-03-22', 'location': '0127850 Persinformatie Politie Kennemerland'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': None, 'timestamp': '18:39:48-03-22', 'location': '1420023 Ambulance-17-123 Rotterdam-Rijnmond'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': 'A2 BAARLE-NASSAU RIT: 37167 (DIRECTE INZET: JA)', 'timestamp': '18:39:37-03-22', 'location': '1220646 Ambulance-20-146 Tilburg-Noord'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': 'A1 EINDHOVEN RIT: 31447', 'timestamp': '18:39:29-03-22', 'location': '1123101
Ambulance-22-101 Eindhoven'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': 'B1 NIEUWEGEIN 29751', 'timestamp': '18:39:26-03-22', 'location': '0726137 Ambulance-09-137 Amersfoort'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': 'B1 AMBU 06152 - AALTEN RIT 27711', 'timestamp': '18:38:22-03-22', 'location': '0820152 Ambulance-06-128 Noord- en Oost-Gelderland'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': None, 'timestamp': '18:38:01-03-22', 'location': '0108999 Monitor Brandweer Veiligheidsregio Kennemerland'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': None, 'timestamp': '18:37:12-03-22', 'location': '0520000 Regionaal Proefalarm
GHOR Drenthe'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': 'A2 DP1 HARNASCHPOLDER VRIJ-HARNASCH DENHZH ', 'timestamp': '18:37:12-03-22', 'location': '.'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': 'A2 11134 RIT 38636 ', 'timestamp': '18:36:30-03-22', 'location': '0126999 Monitor Regionale Ambulancevoorziening Kennemerland'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': 'A1 ZUIDERBEEKWEG 6862EM OOSTERBEEK 41964', 'timestamp': '18:36:28-03-22', 'location': '0920118 Ambulance-07-118 Barneveld'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': 'A2 SOEST 29750', 'timestamp': '18:36:18-03-22', 'location': '0726104 Ambulance-09-148 Utrecht'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': 'A2 RAAMSDONKSVEER RIT: 37166', 'timestamp': '18:36:14-03-22', 'location': '1220626 Ambulance-20-126 Breda-Zuid'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': 'A1 11125 RIT 38634 ELINE VEREPLANTSOEN ZAANDAM', 'timestamp': '18:35:20-03-22',
'location': '0126999 Monitor Regionale Ambulancevoorziening Kennemerland'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': 'A2 ', 'timestamp': '18:35:07-03-22', 'location': '1423397 Ambulance-18-197 Papendrecht'}
我正在尝试从提供事故信息的网站获取数据。我为此使用了 Scrapy 和 Selenium,但它不起作用。我对此很陌生,并试图了解发生了什么。我在一个 venv 中同时安装了 Scrapy 和 Selenium。网站上的结构有点老,很难理解。
如有任何帮助,我们将不胜感激!
我使用的是 Firefox,所以在设置中我使用了这个:
SELENIUM_DRIVER_NAME = 'firefox'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('geckodriver.exe')
SELENIUM_DRIVER_ARGUMENTS=\['-headless'\] \
我的代码如下所示:
import scrapy
from selenium.webdriver import firefox
from http.server import executable
from lib2to3.pgen2 import driver
from scrapy.utils.project import get_project_settings
class MeldingenSpider(scrapy.Spider):
name = '112meldingen'
def start_requests(self):
settings = get_project_settings
driver_path = settings.get('SELENIUM_DRIVER_EXECUTABLE_PATH')
driver = firefox(executable_path=driver_path)
driver.get('http://ftp.112meldingen.nl/index.php')
xpath = '//*[@id="divContentAlerts"]'
link_elements = driver.find_elements_by_xpath(xpath)
def parse(self, response):
articles = response.css('table::attr(id.alerts)')
for article in articles:
#if "haven" in article.css('div.title a::text').get():
yield {
'headline': article.css('td.bold a::text').get() ,
'timestamp': article.css('td.bold span').get(),
'location' : article.xpath('td > td').get()[3]
}
您可以在 SeleniumRequest
的帮助下抓取 url。
脚本:
import scrapy
from scrapy import Selector
from scrapy_selenium import SeleniumRequest
#from selenium.common.exceptions import NoSuchElementException
class MeldingenSpider(scrapy.Spider):
name = 'dingen'
responses = []
def start_requests(self):
yield SeleniumRequest(
url='http://ftp.112meldingen.nl/index.php',
callback=self.parse
)
def parse(self, response):
driver = response.meta['driver']
intial_page = driver.page_source
self.responses.append(intial_page)
driver.implicitly_wait(2)
for resp in self.responses:
r = Selector(text=resp)
articles = r.css('table#alerts')
for article in articles:
#if "haven" in article.css('div.title a::text').get():
yield {
'headline': article.css('td.bold a::text').get() ,
'timestamp': article.css('td.bold span::text').get().replace('\xa0\xa021',''),
'location' : [x.replace('\xa0',' ') for x in article.xpath('.//tr/td[@class="bold center"]/following-sibling::td//text()').getall()][-1]
}
您必须更改 settings.py 文件中的以下指令
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Middleware
DOWNLOADER_MIDDLEWARES = {
'scrapy_selenium.SeleniumMiddleware': 800
}
# Selenium
from shutil import which
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
# '--headless' if using chrome instead of firefox
SELENIUM_DRIVER_ARGUMENTS = ['--headless']
输出:
{'headline': 'B2 ', 'timestamp': '18:42:39-03-22', 'location': '.'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': 'A1 BAARLE-NASSAU RIT: 37170 (DIRECTE INZET: JA)', 'timestamp': '18:42:22-03-22', 'location': '1220499 Monitor Regionale Ambulancevoorziening Midden- en West-Brabant'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': 'A2 CALLENBURGHPLANTSOEN VOORSC DIRECTE INZET 16185', 'timestamp': '18:42:15-03-22', 'location': '1523185 Ambulance-16-185 Hollands Midden'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': 'A2 DP5 WESTLAND GALGEPAD NAALDW ', 'timestamp': '18:42:00-03-22', 'location': '.'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': 'A2 ', 'timestamp': '18:41:31-03-22', 'location': '0120999 Monitor Regionale Ambulancevoorziening Amsterdam-Amstelland'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': 'A2 ', 'timestamp': '18:41:09-03-22', 'location': '1123128 Ambulance-22-128 Helmond'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': 'A1 AMBU 17104 VERBOOMSTRAAT 3082JC ROTTERDAM ROTTDM BON 40895', 'timestamp': '18:40:43-03-22', 'location': '1420999 Monitor Regionale Ambulancevoorziening Rotterdam-Rijnmond'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': 'A2 BEST RIT: 31448', 'timestamp': '18:40:20-03-22', 'location': '1123124 Ambulance-22-124 Helmond'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': 'A2 MIDDELBURG RIT: 37168 (DIRECTE INZET: JA)', 'timestamp': '18:40:11-03-22', 'location': '1320101 Ambulance-19-101 Goes'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': None, 'timestamp': '18:39:49-03-22', 'location': '1420999 Monitor Regionale Ambulancevoorziening Rotterdam-Rijnmond'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': 'ONGEVAL MATERIEEL NIEUWE GRACHT HAARLEM', 'timestamp': '18:39:49-03-22', 'location': '0127850 Persinformatie Politie Kennemerland'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': None, 'timestamp': '18:39:48-03-22', 'location': '1420023 Ambulance-17-123 Rotterdam-Rijnmond'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': 'A2 BAARLE-NASSAU RIT: 37167 (DIRECTE INZET: JA)', 'timestamp': '18:39:37-03-22', 'location': '1220646 Ambulance-20-146 Tilburg-Noord'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': 'A1 EINDHOVEN RIT: 31447', 'timestamp': '18:39:29-03-22', 'location': '1123101
Ambulance-22-101 Eindhoven'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': 'B1 NIEUWEGEIN 29751', 'timestamp': '18:39:26-03-22', 'location': '0726137 Ambulance-09-137 Amersfoort'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': 'B1 AMBU 06152 - AALTEN RIT 27711', 'timestamp': '18:38:22-03-22', 'location': '0820152 Ambulance-06-128 Noord- en Oost-Gelderland'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': None, 'timestamp': '18:38:01-03-22', 'location': '0108999 Monitor Brandweer Veiligheidsregio Kennemerland'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': None, 'timestamp': '18:37:12-03-22', 'location': '0520000 Regionaal Proefalarm
GHOR Drenthe'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': 'A2 DP1 HARNASCHPOLDER VRIJ-HARNASCH DENHZH ', 'timestamp': '18:37:12-03-22', 'location': '.'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': 'A2 11134 RIT 38636 ', 'timestamp': '18:36:30-03-22', 'location': '0126999 Monitor Regionale Ambulancevoorziening Kennemerland'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': 'A1 ZUIDERBEEKWEG 6862EM OOSTERBEEK 41964', 'timestamp': '18:36:28-03-22', 'location': '0920118 Ambulance-07-118 Barneveld'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': 'A2 SOEST 29750', 'timestamp': '18:36:18-03-22', 'location': '0726104 Ambulance-09-148 Utrecht'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': 'A2 RAAMSDONKSVEER RIT: 37166', 'timestamp': '18:36:14-03-22', 'location': '1220626 Ambulance-20-126 Breda-Zuid'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': 'A1 11125 RIT 38634 ELINE VEREPLANTSOEN ZAANDAM', 'timestamp': '18:35:20-03-22',
'location': '0126999 Monitor Regionale Ambulancevoorziening Kennemerland'}
2022-03-21 23:43:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://ftp.112meldingen.nl/index.php>
{'headline': 'A2 ', 'timestamp': '18:35:07-03-22', 'location': '1423397 Ambulance-18-197 Papendrecht'}