Spider 关闭时没有错误消息并且不会抓取分页中的所有页面 (SELENIUM)
Spider closes without error messages and does not scrape all the pages in the pagination (SELENIUM)
我已经创建了一个管道来将所有抓取的数据放入一个 sqlite 数据库中,但是我的蜘蛛没有完成分页。这是蜘蛛关闭时我得到的。我应该得到大约 45k 个结果,但我只得到 420 个。这可能是为什么?
2021-12-06 14:47:55 [scrapy.core.engine] INFO: Closing spider (finished)
2021-12-06 14:47:55 [selenium.webdriver.remote.remote_connection] DEBUG: DELETE http://127.0.0.1:60891/session/d441b41f-b62b-4c64-a5ef-68329c18dd4e {}
2021-12-06 14:47:56 [urllib3.connectionpool] DEBUG: http://127.0.0.1:60891 "DELETE /session/d441b41f-b62b-4c64-a5ef-68329c18dd4e HTTP/1.1" 200 14
2021-12-06 14:47:56 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-12-06 14:47:56 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/response_bytes': 7510132,
'downloader/response_count': 15,
'downloader/response_status_count/200': 15,
'elapsed_time_seconds': 89.469538,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 12, 6, 20, 47, 55, 551566),
'item_scraped_count': 420,
'log_count/DEBUG': 577,
'log_count/INFO': 11,
'request_depth_max': 14,
'response_received_count': 15,
'scheduler/dequeued': 15,
'scheduler/dequeued/memory': 15,
'scheduler/enqueued': 15,
'scheduler/enqueued/memory': 15,
'start_time': datetime.datetime(2021, 12, 6, 20, 46, 26, 82028)}
2021-12-06 14:47:56 [scrapy.core.engine] INFO: Spider closed (finished)
这是我的蜘蛛:
import scrapy
from scrapy_selenium import SeleniumRequest
class HomesSpider(scrapy.Spider):
name = 'homes'
def remove_characters(self,value):
return value.strip(' m²')
def start_requests(self):
yield SeleniumRequest(
url='https://www.vivanuncios.com.mx/s-venta-inmuebles/queretaro/v1c1097l1021p1',
wait_time=3,
callback=self.parse
)
def parse(self, response):
homes = response.xpath("//div[@id='tileRedesign']/div")
for home in homes:
yield {
'price': home.xpath("normalize-space(.//span[@class='ad-price']/text())").get(),
'location': home.xpath(".//div[@class='tile-location one-liner']/b/text()").get(),
'description': home.xpath(".//div[@class='tile-desc one-liner']/a/text()").get(),
'bathrooms': home.xpath("//div[@class='chiplets-inline-block re-bathroom']/text()").get(),
'bedrooms': home.xpath(".//div[@class='chiplets-inline-block re-bedroom']/text()").get(),
'm2': self.remove_characters(home.xpath("normalize-space(.//div[@class='chiplets-inline-block surface-area']/text())").get()),
'link':home.xpath("//div[@class='tile-desc one-liner']/a/@href").get()
}
next_page = response.xpath("//a[@class='icon-pagination-right']/@href").get()
if next_page:
absolute_url = f"https://www.vivanuncios.com.mx/s-venta-inmuebles/queretaro/v1c1097l1021p1{next_page}"
yield SeleniumRequest(
url=absolute_url,
wait_time=3,
callback=self.parse,
dont_filter = True
)
这是否与我的 user_agent 明确相关?无论如何我已经将其分配给 settings.py 还是我被禁止访问此页面?网页的html也完全没有变化
谢谢。
您的代码按照您的预期运行良好,问题出在分页部分,我在开始时就进行了分页 url 哪种类型的分页总是准确的,并且比 next 快两倍以上页。共有 50 页,抓取的项目总数为 1400
脚本
import scrapy
from scrapy_selenium import SeleniumRequest
class HomesSpider(scrapy.Spider):
name = 'homes'
def remove_characters(self, value):
return value.strip(' m²')
def start_requests(self):
urls=[f'https://www.vivanuncios.com.mx/s-venta-inmuebles/queretaro/page-{i}/v1c1097l1021p50'.format(i) for i in range(1,51)]
for url in urls:
yield SeleniumRequest(
url=url,
wait_time=5,
callback=self.parse
)
def parse(self, response):
homes = response.xpath("//div[@id='tileRedesign']/div")
for home in homes:
yield {
'price': home.xpath("normalize-space(.//span[@class='ad-price']/text())").get(),
'location': home.xpath(".//div[@class='tile-location one-liner']/b/text()").get(),
'description': home.xpath(".//div[@class='tile-desc one-liner']/a/text()").get(),
'bathrooms': home.xpath("//div[@class='chiplets-inline-block re-bathroom']/text()").get(),
'bedrooms': home.xpath(".//div[@class='chiplets-inline-block re-bedroom']/text()").get(),
'm2': self.remove_characters(home.xpath("normalize-space(.//div[@class='chiplets-inline-block surface-area']/text())").get()),
'link': home.xpath("//div[@class='tile-desc one-liner']/a/@href").get()
}
输出
{'price': ',520,664', 'location': 'Santiago de Querétaro', 'description': 'Paso de los Toros Residencial el Refugio', 'bathrooms': '2', 'bedrooms': '3', 'm2': '151', 'link': '/d-desarrollo-huizache-boutique-homes/613b978bcb0ee8503b6f9f22'}
2021-12-07 06:06:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.vivanuncios.com.mx/s-venta-inmuebles/queretaro/page-50/v1c1097l1021p50>
{'price': ',690,000', 'location': 'El Refugio', 'description': 'Riaño Residencial el Refugio', 'bathrooms': '2', 'bedrooms': '3', 'm2': '224', 'link': '/d-desarrollo-huizache-boutique-homes/613b978bcb0ee8503b6f9f22'}
2021-12-07 06:06:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.vivanuncios.com.mx/s-venta-inmuebles/queretaro/page-50/v1c1097l1021p50>
{'price': '', 'location': None, 'description': None, 'bathrooms': '2', 'bedrooms': None, 'm2': '', 'link': '/d-desarrollo-huizache-boutique-homes/613b978bcb0ee8503b6f9f22'}
2021-12-07 06:06:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.vivanuncios.com.mx/s-venta-inmuebles/queretaro/page-50/v1c1097l1021p50>
{'price': '', 'location': None, 'description': None, 'bathrooms': '2', 'bedrooms': None, 'm2': '', 'link': '/d-desarrollo-rincones-marques/5d6951eee4b05e9aaae12de6'}
2021-12-07 06:06:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.vivanuncios.com.mx/s-venta-inmuebles/queretaro/page-50/v1c1097l1021p50>
{'price': ',690,000', 'location': 'El Refugio', 'description': 'Riaño Residencial el Refugio', 'bathrooms': '2', 'bedrooms': '3', 'm2': '224', 'link': '/d-desarrollo-huizache-boutique-homes/613b978bcb0ee8503b6f9f22'}
2021-12-07 06:06:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.vivanuncios.com.mx/s-venta-inmuebles/queretaro/page-50/v1c1097l1021p50>
{'price': '', 'location': None, 'description': None, 'bathrooms': '2', 'bedrooms': None, 'm2': '', 'link': '/d-desarrollo-huizache-boutique-homes/613b978bcb0ee8503b6f9f22'}
2021-12-07 06:06:33 [scrapy.core.engine] INFO: Closing spider (finished)
2021-12-07 06:06:33 [selenium.webdriver.remote.remote_connection] DEBUG: DELETE http://127.0.0.1:65206/session/1487a9ea1c9752794aad497613552337 {}
2021-12-07 06:06:33 [urllib3.connectionpool] DEBUG: http://127.0.0.1:65206 "DELETE /session/1487a9ea1c9752794aad497613552337 HTTP/1.1" 200 14
2021-12-07 06:06:33 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-12-07 06:06:35 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/response_bytes': 23589849,
'downloader/response_count': 50,
'downloader/response_status_count/200': 50,
'elapsed_time_seconds': 150.933428,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 12, 7, 0, 6, 33, 111357),
'item_scraped_count': 1400,
..等等
我已经创建了一个管道来将所有抓取的数据放入一个 sqlite 数据库中,但是我的蜘蛛没有完成分页。这是蜘蛛关闭时我得到的。我应该得到大约 45k 个结果,但我只得到 420 个。这可能是为什么?
2021-12-06 14:47:55 [scrapy.core.engine] INFO: Closing spider (finished)
2021-12-06 14:47:55 [selenium.webdriver.remote.remote_connection] DEBUG: DELETE http://127.0.0.1:60891/session/d441b41f-b62b-4c64-a5ef-68329c18dd4e {}
2021-12-06 14:47:56 [urllib3.connectionpool] DEBUG: http://127.0.0.1:60891 "DELETE /session/d441b41f-b62b-4c64-a5ef-68329c18dd4e HTTP/1.1" 200 14
2021-12-06 14:47:56 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-12-06 14:47:56 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/response_bytes': 7510132,
'downloader/response_count': 15,
'downloader/response_status_count/200': 15,
'elapsed_time_seconds': 89.469538,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 12, 6, 20, 47, 55, 551566),
'item_scraped_count': 420,
'log_count/DEBUG': 577,
'log_count/INFO': 11,
'request_depth_max': 14,
'response_received_count': 15,
'scheduler/dequeued': 15,
'scheduler/dequeued/memory': 15,
'scheduler/enqueued': 15,
'scheduler/enqueued/memory': 15,
'start_time': datetime.datetime(2021, 12, 6, 20, 46, 26, 82028)}
2021-12-06 14:47:56 [scrapy.core.engine] INFO: Spider closed (finished)
这是我的蜘蛛:
import scrapy
from scrapy_selenium import SeleniumRequest
class HomesSpider(scrapy.Spider):
name = 'homes'
def remove_characters(self,value):
return value.strip(' m²')
def start_requests(self):
yield SeleniumRequest(
url='https://www.vivanuncios.com.mx/s-venta-inmuebles/queretaro/v1c1097l1021p1',
wait_time=3,
callback=self.parse
)
def parse(self, response):
homes = response.xpath("//div[@id='tileRedesign']/div")
for home in homes:
yield {
'price': home.xpath("normalize-space(.//span[@class='ad-price']/text())").get(),
'location': home.xpath(".//div[@class='tile-location one-liner']/b/text()").get(),
'description': home.xpath(".//div[@class='tile-desc one-liner']/a/text()").get(),
'bathrooms': home.xpath("//div[@class='chiplets-inline-block re-bathroom']/text()").get(),
'bedrooms': home.xpath(".//div[@class='chiplets-inline-block re-bedroom']/text()").get(),
'm2': self.remove_characters(home.xpath("normalize-space(.//div[@class='chiplets-inline-block surface-area']/text())").get()),
'link':home.xpath("//div[@class='tile-desc one-liner']/a/@href").get()
}
next_page = response.xpath("//a[@class='icon-pagination-right']/@href").get()
if next_page:
absolute_url = f"https://www.vivanuncios.com.mx/s-venta-inmuebles/queretaro/v1c1097l1021p1{next_page}"
yield SeleniumRequest(
url=absolute_url,
wait_time=3,
callback=self.parse,
dont_filter = True
)
这是否与我的 user_agent 明确相关?无论如何我已经将其分配给 settings.py 还是我被禁止访问此页面?网页的html也完全没有变化
谢谢。
您的代码按照您的预期运行良好,问题出在分页部分,我在开始时就进行了分页 url 哪种类型的分页总是准确的,并且比 next 快两倍以上页。共有 50 页,抓取的项目总数为 1400
脚本
import scrapy
from scrapy_selenium import SeleniumRequest
class HomesSpider(scrapy.Spider):
name = 'homes'
def remove_characters(self, value):
return value.strip(' m²')
def start_requests(self):
urls=[f'https://www.vivanuncios.com.mx/s-venta-inmuebles/queretaro/page-{i}/v1c1097l1021p50'.format(i) for i in range(1,51)]
for url in urls:
yield SeleniumRequest(
url=url,
wait_time=5,
callback=self.parse
)
def parse(self, response):
homes = response.xpath("//div[@id='tileRedesign']/div")
for home in homes:
yield {
'price': home.xpath("normalize-space(.//span[@class='ad-price']/text())").get(),
'location': home.xpath(".//div[@class='tile-location one-liner']/b/text()").get(),
'description': home.xpath(".//div[@class='tile-desc one-liner']/a/text()").get(),
'bathrooms': home.xpath("//div[@class='chiplets-inline-block re-bathroom']/text()").get(),
'bedrooms': home.xpath(".//div[@class='chiplets-inline-block re-bedroom']/text()").get(),
'm2': self.remove_characters(home.xpath("normalize-space(.//div[@class='chiplets-inline-block surface-area']/text())").get()),
'link': home.xpath("//div[@class='tile-desc one-liner']/a/@href").get()
}
输出
{'price': ',520,664', 'location': 'Santiago de Querétaro', 'description': 'Paso de los Toros Residencial el Refugio', 'bathrooms': '2', 'bedrooms': '3', 'm2': '151', 'link': '/d-desarrollo-huizache-boutique-homes/613b978bcb0ee8503b6f9f22'}
2021-12-07 06:06:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.vivanuncios.com.mx/s-venta-inmuebles/queretaro/page-50/v1c1097l1021p50>
{'price': ',690,000', 'location': 'El Refugio', 'description': 'Riaño Residencial el Refugio', 'bathrooms': '2', 'bedrooms': '3', 'm2': '224', 'link': '/d-desarrollo-huizache-boutique-homes/613b978bcb0ee8503b6f9f22'}
2021-12-07 06:06:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.vivanuncios.com.mx/s-venta-inmuebles/queretaro/page-50/v1c1097l1021p50>
{'price': '', 'location': None, 'description': None, 'bathrooms': '2', 'bedrooms': None, 'm2': '', 'link': '/d-desarrollo-huizache-boutique-homes/613b978bcb0ee8503b6f9f22'}
2021-12-07 06:06:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.vivanuncios.com.mx/s-venta-inmuebles/queretaro/page-50/v1c1097l1021p50>
{'price': '', 'location': None, 'description': None, 'bathrooms': '2', 'bedrooms': None, 'm2': '', 'link': '/d-desarrollo-rincones-marques/5d6951eee4b05e9aaae12de6'}
2021-12-07 06:06:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.vivanuncios.com.mx/s-venta-inmuebles/queretaro/page-50/v1c1097l1021p50>
{'price': ',690,000', 'location': 'El Refugio', 'description': 'Riaño Residencial el Refugio', 'bathrooms': '2', 'bedrooms': '3', 'm2': '224', 'link': '/d-desarrollo-huizache-boutique-homes/613b978bcb0ee8503b6f9f22'}
2021-12-07 06:06:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.vivanuncios.com.mx/s-venta-inmuebles/queretaro/page-50/v1c1097l1021p50>
{'price': '', 'location': None, 'description': None, 'bathrooms': '2', 'bedrooms': None, 'm2': '', 'link': '/d-desarrollo-huizache-boutique-homes/613b978bcb0ee8503b6f9f22'}
2021-12-07 06:06:33 [scrapy.core.engine] INFO: Closing spider (finished)
2021-12-07 06:06:33 [selenium.webdriver.remote.remote_connection] DEBUG: DELETE http://127.0.0.1:65206/session/1487a9ea1c9752794aad497613552337 {}
2021-12-07 06:06:33 [urllib3.connectionpool] DEBUG: http://127.0.0.1:65206 "DELETE /session/1487a9ea1c9752794aad497613552337 HTTP/1.1" 200 14
2021-12-07 06:06:33 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-12-07 06:06:35 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/response_bytes': 23589849,
'downloader/response_count': 50,
'downloader/response_status_count/200': 50,
'elapsed_time_seconds': 150.933428,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 12, 7, 0, 6, 33, 111357),
'item_scraped_count': 1400,
..等等