Scrapy - 使用第一个 URL 的结果抓取多个 URLs
Scrapy - Scrape multiple URLs using results from the first URL
- 我使用 Scrapy 从第一个 URL.
抓取数据
- 第一个 URL return 个响应包含 URL 个列表。
到目前为止对我来说还可以。我的问题是如何进一步抓取 URL 的列表?搜索后,我知道我可以在解析中 return 一个请求,但似乎只能处理一个 URL.
这是我的解析:
def parse(self, response):
# Get the list of URLs, for example:
list = ["http://a.com", "http://b.com", "http://c.com"]
return scrapy.Request(list[0])
# It works, but how can I continue b.com and c.com?
我可以这样做吗?
def parse(self, response):
# Get the list of URLs, for example:
list = ["http://a.com", "http://b.com", "http://c.com"]
for link in list:
scrapy.Request(link)
# This is wrong, though I need something like this
完整版:
import scrapy
class MySpider(scrapy.Spider):
name = "mySpider"
allowed_domains = ["x.com"]
start_urls = ["http://x.com"]
def parse(self, response):
# Get the list of URLs, for example:
list = ["http://a.com", "http://b.com", "http://c.com"]
for link in list:
scrapy.Request(link)
# This is wrong, though I need something like this
为此,您需要继承 scrapy.spider
并定义一个 URL 列表作为开始。然后,Scrapy 会自动跟随它找到的链接。
就像这样:
import scrapy
class YourSpider(scrapy.Spider):
name = "your_spider"
allowed_domains = ["a.com", "b.com", "c.com"]
start_urls = [
"http://a.com/",
"http://b.com/",
"http://c.com/",
]
def parse(self, response):
# do whatever you want
pass
您可以在 Scrapy 的 official documentation 上找到更多信息。
我想你要找的是 yield 语句:
def parse(self, response):
# Get the list of URLs, for example:
list = ["http://a.com", "http://b.com", "http://c.com"]
for link in list:
request = scrapy.Request(link)
yield request
# within your parse method:
urlList = response.xpath('//a/@href').extract()
print(urlList) #to see the list of URLs
for url in urlList:
yield scrapy.Request(url, callback=self.parse)
这应该有效
- 我使用 Scrapy 从第一个 URL. 抓取数据
- 第一个 URL return 个响应包含 URL 个列表。
到目前为止对我来说还可以。我的问题是如何进一步抓取 URL 的列表?搜索后,我知道我可以在解析中 return 一个请求,但似乎只能处理一个 URL.
这是我的解析:
def parse(self, response):
# Get the list of URLs, for example:
list = ["http://a.com", "http://b.com", "http://c.com"]
return scrapy.Request(list[0])
# It works, but how can I continue b.com and c.com?
我可以这样做吗?
def parse(self, response):
# Get the list of URLs, for example:
list = ["http://a.com", "http://b.com", "http://c.com"]
for link in list:
scrapy.Request(link)
# This is wrong, though I need something like this
完整版:
import scrapy
class MySpider(scrapy.Spider):
name = "mySpider"
allowed_domains = ["x.com"]
start_urls = ["http://x.com"]
def parse(self, response):
# Get the list of URLs, for example:
list = ["http://a.com", "http://b.com", "http://c.com"]
for link in list:
scrapy.Request(link)
# This is wrong, though I need something like this
为此,您需要继承 scrapy.spider
并定义一个 URL 列表作为开始。然后,Scrapy 会自动跟随它找到的链接。
就像这样:
import scrapy
class YourSpider(scrapy.Spider):
name = "your_spider"
allowed_domains = ["a.com", "b.com", "c.com"]
start_urls = [
"http://a.com/",
"http://b.com/",
"http://c.com/",
]
def parse(self, response):
# do whatever you want
pass
您可以在 Scrapy 的 official documentation 上找到更多信息。
我想你要找的是 yield 语句:
def parse(self, response):
# Get the list of URLs, for example:
list = ["http://a.com", "http://b.com", "http://c.com"]
for link in list:
request = scrapy.Request(link)
yield request
# within your parse method:
urlList = response.xpath('//a/@href').extract()
print(urlList) #to see the list of URLs
for url in urlList:
yield scrapy.Request(url, callback=self.parse)
这应该有效