将Scrapy爬虫规则中的相对URL转换为绝对URL
Convert relative URL in Scrapy Crawler Rule to absolute URL
我正在尝试使用此规则创建一个抓取工具,它将点击进入每个 属性 的页面并获取详细信息。但是URL是一个相对的URL,它不能在Scrapy Crawler Rule中使用,因为它只接受绝对的URL。这是我使用 process_value 想出的解决方案,但它不起作用。任何人都可以帮助建议另一种方法来解决这个问题,谢谢!
当前代码如下:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class EdgepropSpider(CrawlSpider):
name = 'edgeprop'
allowed_domains = ['edgeprop.my']
start_urls = ['https://www.edgeprop.my/buy/malaysia/all-residential']
rules = (
Rule(LinkExtractor(restrict_xpaths=("//div[@class='card tep-listing-card']/a/@href"), process_value= lambda x: 'https://edgeprop.my'+x), callback='parse_item', follow=True),
#Rule(LinkExtractor(restrict_xpaths=("//nav[@aria-label='Listing Page navigation']//li[position() = last()]/a")), follow=True)
)
def parse_item(self, response):
yield {
'Name': response.xpath("//div[@class='save-share']/following-sibling::h1/text()").get()
}
这是输出:
2021-12-29 10:42:18 [scrapy.core.engine] INFO: Spider opened
2021-12-29 10:42:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-12-29 10:42:18 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-12-29 10:42:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.edgeprop.my/buy/malaysia/all-residential> (referer: None)
2021-12-29 10:42:18 [scrapy.core.engine] INFO: Closing spider (finished)
2021-12-29 10:42:18 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 328,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 4126,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 0.237148,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 12, 29, 2, 42, 18, 936521),
'httpcompression/response_bytes': 10918,
'httpcompression/response_count': 1,
'log_count/DEBUG': 1,
'log_count/INFO': 10,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2021, 12, 29, 2, 42, 18, 699373)}
2021-12-29 10:42:18 [scrapy.core.engine] INFO: Spider closed (finished)
这比你想象的要容易得多;您可以从网页的响应开始,并通过 json 请求进行抓取。
下一页也由负载提供,以及每页的所有属性。我已经构建了一个简单的抓取器来抓取所有的响应,你只需要解析字典。请注意,您可能会被重定向,因此我添加了一个 DOWNLOAD_DELAY
可能会有所帮助。其他一切都是不言自明的。
import scrapy
from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst
from scrapy.item import Field
from scrapy.crawler import CrawlerProcess
from scrapy.http.request.form import FormRequest
from scrapy_splash import SplashRequest
from scrapy_splash.request import SplashFormRequest
class MalItem(scrapy.Item):
listings = Field(output_processor = TakeFirst())
class MalSpider(scrapy.Spider):
name = 'Mala'
#start_urls = []
start_urls = ['https://www.edgeprop.my/jwdsonic/api/v1/property/search?&listing_type=sale&state=Malaysia&property_type=rl&start=0&size=20']
#for i in range(0, 5):
# start_urls.append(f'https://www.edgeprop.my/jwdsonic/api/v1/property/search?&listing_type=sale&state=Malaysia&property_type=rl&start={i}&size=20')
custom_settings = {
#'LOG_LEVEL': 'CRITICAL',
'ROBOTSTXT_OBEY': False,
'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36',
'CONCURRENT_REQUESTS': 100,
'CONCURRENT_REQUESTS_PER_IP': 100,
'DOWNLOAD_DELAY':3
}
def start_requests(self):
for url in self.start_urls:
for i in range(0, 7709):
yield scrapy.FormRequest(
url,
method='GET',
formdata = {
'listing_type': 'sale',
'state': 'Malaysia',
'property_type': 'rl',
'start': str(i),
'size': '20'
},
callback = self.parse
)
def parse(self, response):
links = response.json().get('property')
for stuff in links:
loader = ItemLoader(MalItem())
loader.add_value('listings', stuff)
yield loader.load_item()
process = CrawlerProcess(
settings = {
"FEED_URI":'stuff.jl',
"FEED_FORMAT":'jsonlines'
}
)
我正在尝试使用此规则创建一个抓取工具,它将点击进入每个 属性 的页面并获取详细信息。但是URL是一个相对的URL,它不能在Scrapy Crawler Rule中使用,因为它只接受绝对的URL。这是我使用 process_value 想出的解决方案,但它不起作用。任何人都可以帮助建议另一种方法来解决这个问题,谢谢!
当前代码如下:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class EdgepropSpider(CrawlSpider):
name = 'edgeprop'
allowed_domains = ['edgeprop.my']
start_urls = ['https://www.edgeprop.my/buy/malaysia/all-residential']
rules = (
Rule(LinkExtractor(restrict_xpaths=("//div[@class='card tep-listing-card']/a/@href"), process_value= lambda x: 'https://edgeprop.my'+x), callback='parse_item', follow=True),
#Rule(LinkExtractor(restrict_xpaths=("//nav[@aria-label='Listing Page navigation']//li[position() = last()]/a")), follow=True)
)
def parse_item(self, response):
yield {
'Name': response.xpath("//div[@class='save-share']/following-sibling::h1/text()").get()
}
这是输出:
2021-12-29 10:42:18 [scrapy.core.engine] INFO: Spider opened
2021-12-29 10:42:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-12-29 10:42:18 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-12-29 10:42:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.edgeprop.my/buy/malaysia/all-residential> (referer: None)
2021-12-29 10:42:18 [scrapy.core.engine] INFO: Closing spider (finished)
2021-12-29 10:42:18 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 328,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 4126,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 0.237148,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 12, 29, 2, 42, 18, 936521),
'httpcompression/response_bytes': 10918,
'httpcompression/response_count': 1,
'log_count/DEBUG': 1,
'log_count/INFO': 10,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2021, 12, 29, 2, 42, 18, 699373)}
2021-12-29 10:42:18 [scrapy.core.engine] INFO: Spider closed (finished)
这比你想象的要容易得多;您可以从网页的响应开始,并通过 json 请求进行抓取。
下一页也由负载提供,以及每页的所有属性。我已经构建了一个简单的抓取器来抓取所有的响应,你只需要解析字典。请注意,您可能会被重定向,因此我添加了一个 DOWNLOAD_DELAY
可能会有所帮助。其他一切都是不言自明的。
import scrapy
from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst
from scrapy.item import Field
from scrapy.crawler import CrawlerProcess
from scrapy.http.request.form import FormRequest
from scrapy_splash import SplashRequest
from scrapy_splash.request import SplashFormRequest
class MalItem(scrapy.Item):
listings = Field(output_processor = TakeFirst())
class MalSpider(scrapy.Spider):
name = 'Mala'
#start_urls = []
start_urls = ['https://www.edgeprop.my/jwdsonic/api/v1/property/search?&listing_type=sale&state=Malaysia&property_type=rl&start=0&size=20']
#for i in range(0, 5):
# start_urls.append(f'https://www.edgeprop.my/jwdsonic/api/v1/property/search?&listing_type=sale&state=Malaysia&property_type=rl&start={i}&size=20')
custom_settings = {
#'LOG_LEVEL': 'CRITICAL',
'ROBOTSTXT_OBEY': False,
'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36',
'CONCURRENT_REQUESTS': 100,
'CONCURRENT_REQUESTS_PER_IP': 100,
'DOWNLOAD_DELAY':3
}
def start_requests(self):
for url in self.start_urls:
for i in range(0, 7709):
yield scrapy.FormRequest(
url,
method='GET',
formdata = {
'listing_type': 'sale',
'state': 'Malaysia',
'property_type': 'rl',
'start': str(i),
'size': '20'
},
callback = self.parse
)
def parse(self, response):
links = response.json().get('property')
for stuff in links:
loader = ItemLoader(MalItem())
loader.add_value('listings', stuff)
yield loader.load_item()
process = CrawlerProcess(
settings = {
"FEED_URI":'stuff.jl',
"FEED_FORMAT":'jsonlines'
}
)