Scrapy 只抓取和爬取 HTML 和 TXT
Scrapy only scraping and crawling HTML and TXT
出于学习目的,我一直在尝试递归爬取和抓取https://triniate.com/images/
上的所有URL,但似乎Scrapy只想抓取和抓取TXT,HTML, 和 PHP URLs.
这是我的爬虫代码
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from HelloScrapy.items import PageInfoItem
class HelloSpider(CrawlSpider):
#Identifier when executing scrapy from CLI
name = 'hello'
#Domains that allow spiders to explore
allowed_domains = ["triniate.com"]
#Starting point(Start exploration)URL
start_urls = ["https://triniate.com/images/"]
#Specific rule with LinkExtractor argument(For example, scrape only pages that include new in the URL)Can be specified, but this time there is no argument because it targets all pages
#When you download a page that matches the Rule, the function specified in callback will be called.
#If follow is set to True, the search will be performed recursively.
rules = [Rule(LinkExtractor(), callback='parse_pageinfo', follow=True)]
def parse_pageinfo(self, response):
item = PageInfoItem()
item['URL'] = response.url
#Specify which part of the page to scrape
#In addition to specifying in xPath format, it is also possible to specify in CSS format
item['title'] = "idc"
return item
items.py 包含
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
from scrapy.item import Item, Field
class PageInfoItem(Item):
URL = Field()
title = Field()
pass
控制台输出为
2022-04-21 22:30:50 [scrapy.core.engine] INFO: Closing spider (finished)
2022-04-21 22:30:50 [scrapy.extensions.feedexport] INFO: Stored json feed (175 items) in: haxx.json
2022-04-21 22:30:50 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 59541,
'downloader/request_count': 176,
'downloader/request_method_count/GET': 176,
'downloader/response_bytes': 227394,
'downloader/response_count': 176,
'downloader/response_status_count/200': 176,
'dupefilter/filtered': 875,
'elapsed_time_seconds': 8.711563,
'feedexport/success_count/FileFeedStorage': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 4, 22, 3, 30, 50, 142416),
'httpcompression/response_bytes': 402654,
'httpcompression/response_count': 175,
'item_scraped_count': 175,
'log_count/DEBUG': 357,
'log_count/INFO': 11,
'request_depth_max': 5,
'response_received_count': 176,
'scheduler/dequeued': 176,
'scheduler/dequeued/memory': 176,
'scheduler/enqueued': 176,
'scheduler/enqueued/memory': 176,
'start_time': datetime.datetime(2022, 4, 22, 3, 30, 41, 430853)}
2022-04-21 22:30:50 [scrapy.core.engine] INFO: Spider closed (finished)
有人可以建议我应该如何更改我的代码以反映我想要的结果吗?
编辑:澄清一下,我正在尝试获取 URL,而不是图像或文件本身。
要做到这一点,你需要知道 Scrapy 是如何工作的。首先你应该写一个蜘蛛递归地从根 URL 开始爬取所有目录。并在访问页面时提取所有图像链接。
所以我为您编写了这段代码,并在您提供的网站上进行了测试。它非常有效。
import scrapy
class ImagesSpider(scrapy.Spider):
name = "images"
image_ext = ['png', 'gif']
images_urls = set()
start_urls = [
'https://triniate.com/images/',
# if there are some other urls you want to scrape the same way
# add them in this list
]
for url in start_urls:
yield scrapy.Request(url=url, callback=self.get_images)
def get_images(self, response):
all_hrefs = response.css('a::attr(href)').getall()
all_images_links = list(filter(lambda x: x.split('.')[-1] in self.image_ext, all_hrefs))
for link in all_images_links:
self.images_urls.add(link)
yield {'link': f'{response.request.url}{link}'}
next_page_links = list(filter(lambda x: x[-1]=='/', all_hrefs))
for link in next_page_links:
yield response.follow(link, callback=self.get_images)
因此,通过这种方式,您可以获得此页面上提供的所有图像的所有链接以及任何内部目录(递归)。
get_images
方法搜索页面中的任何图像。它获取所有图像链接,然后还放置任何目录链接以供抓取。所以它获取所有目录的所有图像链接。
我提供的代码生成的结果包含您想要的所有链接:
[
{"link": "https://triniate.com/images/ChatIcon.png"},
{"link": "https://triniate.com/images/Sprite1.gif"},
{"link": "https://triniate.com/images/a.png"},
...
...
...
{"link": "https://triniate.com/images/objects/house_objects/workbench.png"}
]
注意:我在image_ext
属性中指定了图像文件的扩展名。您可以将其扩展到所有可用的图像扩展,或者像我一样只包含网站中存在的扩展。您的选择。
我尝试使用基本的 spider 和 scrapy selenium。并且有效。
basic.py
import scrapy
from scrapy_selenium import SeleniumRequest
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
class BasicSpider(scrapy.Spider):
name = 'basic'
allowed_domains = ['triniate.com']
def start_requests(self):
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.set_window_size(1920, 1080)
driver.get("https://triniate.com/images/")
links = driver.find_elements(By.XPATH, "//html/body/table/tbody/tr/td[2]/a")
for link in links:
href= link.get_attribute('href')
yield SeleniumRequest(
url = href
)
driver.quit()
return super().start_requests()
def parse(self, response):
yield {
'URL': response.url
}
settings.py
已添加
DOWNLOADER_MIDDLEWARES = {
'scrapy_selenium.SeleniumMiddleware': 800
}
产出
2022-04-22 12:03:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://triniate.com/images/stand_right.gif>
{'URL': 'https://triniate.com/images/stand_right.gif'}
2022-04-22 12:03:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://triniate.com/images/walk_right_transparent.gif> (referer: None)
2022-04-22 12:03:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://triniate.com/images/walk_back.gif>
{'URL': 'https://triniate.com/images/walk_back.gif'}
2022-04-22 12:03:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://triniate.com/images/walk_left_transparent.gif>
{'URL': 'https://triniate.com/images/walk_left_transparent.gif'}
2022-04-22 12:03:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://triniate.com/images/walk_front_transparent.gif>
{'URL': 'https://triniate.com/images/walk_front_transparent.gif'}
2022-04-22 12:03:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://triniate.com/images/walk_back_transparent.gif>
{'URL': 'https://triniate.com/images/walk_back_transparent.gif'}
2022-04-22 12:03:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://triniate.com/images/walk_right.gif>
{'URL': 'https://triniate.com/images/walk_right.gif'}
2022-04-22 12:03:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://triniate.com/images/walk_right_transparent.gif>
{'URL': 'https://triniate.com/images/walk_right_transparent.gif'}
2022-04-22 12:03:52 [scrapy.core.engine] INFO: Closing spider (finished)
出于学习目的,我一直在尝试递归爬取和抓取https://triniate.com/images/
上的所有URL,但似乎Scrapy只想抓取和抓取TXT,HTML, 和 PHP URLs.
这是我的爬虫代码
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from HelloScrapy.items import PageInfoItem
class HelloSpider(CrawlSpider):
#Identifier when executing scrapy from CLI
name = 'hello'
#Domains that allow spiders to explore
allowed_domains = ["triniate.com"]
#Starting point(Start exploration)URL
start_urls = ["https://triniate.com/images/"]
#Specific rule with LinkExtractor argument(For example, scrape only pages that include new in the URL)Can be specified, but this time there is no argument because it targets all pages
#When you download a page that matches the Rule, the function specified in callback will be called.
#If follow is set to True, the search will be performed recursively.
rules = [Rule(LinkExtractor(), callback='parse_pageinfo', follow=True)]
def parse_pageinfo(self, response):
item = PageInfoItem()
item['URL'] = response.url
#Specify which part of the page to scrape
#In addition to specifying in xPath format, it is also possible to specify in CSS format
item['title'] = "idc"
return item
items.py 包含
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
from scrapy.item import Item, Field
class PageInfoItem(Item):
URL = Field()
title = Field()
pass
控制台输出为
2022-04-21 22:30:50 [scrapy.core.engine] INFO: Closing spider (finished)
2022-04-21 22:30:50 [scrapy.extensions.feedexport] INFO: Stored json feed (175 items) in: haxx.json
2022-04-21 22:30:50 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 59541,
'downloader/request_count': 176,
'downloader/request_method_count/GET': 176,
'downloader/response_bytes': 227394,
'downloader/response_count': 176,
'downloader/response_status_count/200': 176,
'dupefilter/filtered': 875,
'elapsed_time_seconds': 8.711563,
'feedexport/success_count/FileFeedStorage': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 4, 22, 3, 30, 50, 142416),
'httpcompression/response_bytes': 402654,
'httpcompression/response_count': 175,
'item_scraped_count': 175,
'log_count/DEBUG': 357,
'log_count/INFO': 11,
'request_depth_max': 5,
'response_received_count': 176,
'scheduler/dequeued': 176,
'scheduler/dequeued/memory': 176,
'scheduler/enqueued': 176,
'scheduler/enqueued/memory': 176,
'start_time': datetime.datetime(2022, 4, 22, 3, 30, 41, 430853)}
2022-04-21 22:30:50 [scrapy.core.engine] INFO: Spider closed (finished)
有人可以建议我应该如何更改我的代码以反映我想要的结果吗?
编辑:澄清一下,我正在尝试获取 URL,而不是图像或文件本身。
要做到这一点,你需要知道 Scrapy 是如何工作的。首先你应该写一个蜘蛛递归地从根 URL 开始爬取所有目录。并在访问页面时提取所有图像链接。
所以我为您编写了这段代码,并在您提供的网站上进行了测试。它非常有效。
import scrapy
class ImagesSpider(scrapy.Spider):
name = "images"
image_ext = ['png', 'gif']
images_urls = set()
start_urls = [
'https://triniate.com/images/',
# if there are some other urls you want to scrape the same way
# add them in this list
]
for url in start_urls:
yield scrapy.Request(url=url, callback=self.get_images)
def get_images(self, response):
all_hrefs = response.css('a::attr(href)').getall()
all_images_links = list(filter(lambda x: x.split('.')[-1] in self.image_ext, all_hrefs))
for link in all_images_links:
self.images_urls.add(link)
yield {'link': f'{response.request.url}{link}'}
next_page_links = list(filter(lambda x: x[-1]=='/', all_hrefs))
for link in next_page_links:
yield response.follow(link, callback=self.get_images)
因此,通过这种方式,您可以获得此页面上提供的所有图像的所有链接以及任何内部目录(递归)。
get_images
方法搜索页面中的任何图像。它获取所有图像链接,然后还放置任何目录链接以供抓取。所以它获取所有目录的所有图像链接。
我提供的代码生成的结果包含您想要的所有链接:
[
{"link": "https://triniate.com/images/ChatIcon.png"},
{"link": "https://triniate.com/images/Sprite1.gif"},
{"link": "https://triniate.com/images/a.png"},
...
...
...
{"link": "https://triniate.com/images/objects/house_objects/workbench.png"}
]
注意:我在image_ext
属性中指定了图像文件的扩展名。您可以将其扩展到所有可用的图像扩展,或者像我一样只包含网站中存在的扩展。您的选择。
我尝试使用基本的 spider 和 scrapy selenium。并且有效。
basic.py
import scrapy
from scrapy_selenium import SeleniumRequest
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
class BasicSpider(scrapy.Spider):
name = 'basic'
allowed_domains = ['triniate.com']
def start_requests(self):
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.set_window_size(1920, 1080)
driver.get("https://triniate.com/images/")
links = driver.find_elements(By.XPATH, "//html/body/table/tbody/tr/td[2]/a")
for link in links:
href= link.get_attribute('href')
yield SeleniumRequest(
url = href
)
driver.quit()
return super().start_requests()
def parse(self, response):
yield {
'URL': response.url
}
settings.py
已添加
DOWNLOADER_MIDDLEWARES = {
'scrapy_selenium.SeleniumMiddleware': 800
}
产出
2022-04-22 12:03:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://triniate.com/images/stand_right.gif>
{'URL': 'https://triniate.com/images/stand_right.gif'}
2022-04-22 12:03:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://triniate.com/images/walk_right_transparent.gif> (referer: None)
2022-04-22 12:03:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://triniate.com/images/walk_back.gif>
{'URL': 'https://triniate.com/images/walk_back.gif'}
2022-04-22 12:03:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://triniate.com/images/walk_left_transparent.gif>
{'URL': 'https://triniate.com/images/walk_left_transparent.gif'}
2022-04-22 12:03:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://triniate.com/images/walk_front_transparent.gif>
{'URL': 'https://triniate.com/images/walk_front_transparent.gif'}
2022-04-22 12:03:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://triniate.com/images/walk_back_transparent.gif>
{'URL': 'https://triniate.com/images/walk_back_transparent.gif'}
2022-04-22 12:03:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://triniate.com/images/walk_right.gif>
{'URL': 'https://triniate.com/images/walk_right.gif'}
2022-04-22 12:03:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://triniate.com/images/walk_right_transparent.gif>
{'URL': 'https://triniate.com/images/walk_right_transparent.gif'}
2022-04-22 12:03:52 [scrapy.core.engine] INFO: Closing spider (finished)