下载具有深度的特定主题的所有相关 .PDF 文件
Download all related .PDF file for specif topic with depth
我对 python 和 scrapy 还很陌生。我的任务是下载特定主题的 .PDF 文件。例如:这个网站上有更多合同** https://www.sec.gov/ ** 目前我正在一个一个地下载文件。我必须编写一个 scrapy 程序来使用搜索关键字下载所有相关的 .PDF 文件,例如 ** 关键字:Exhibit 10/ EXHIBIT 11 **
## My Code ##
#import urllib
import scrapy
from scrapy.http import Request
class pwc_tax(scrapy.Spider):
name = "pwc_tax"
allowed_domains = ["www.sec.gov"]
start_urls = ["https://www.sec.gov/cubist-pharmaceuticals-inc-exhibit-10-65-10-k405"]
def parse(self, response):
base_url = 'https://www.sec.gov/'
for a in response.xpath('//a[@href]/@href'):
link = a.extract()
# self.logger.info(link)
if link.endswith('.pdf'):
#link = urllib.parse.urljoin(base_url, link)
link = base_url + link
self.logger.info(link)
yield Request(link, callback=self.save_pdf)
def save_pdf(self, response):
path = response.url.split('/')[-1]
self.logger.info('Saving PDF %s', path)
with open(path, 'wb') as f:
f.write(response.body)
使用这个 scrapy 代码,我只能在给定的 URL 中下载 PDF。
例如:https://www.sec.gov/cubist-pharmaceuticals-inc-exhibit-10-65-10-k405
(如果我在上面给出 URL 文件已经下载但是为此我可以手动下载,我必须下载整个 PDF 是搜索项目)
如果我使用 Exhibit 10 关键字进行搜索,将出现以下页面 https://secsearch.sec.gov/search?utf8=%3F&affiliate=secsearch&query=exhibit+10,我想通过 scrapy 打开所有链接并下载所有 pdf。如果有人帮我解决这段代码。提前致谢。
您应该首先在 start_urls
中获取搜索查询 url 并从 start_url 的响应中提取所有 url 并将请求发送到每个 url他们。之后提取 pdf link 并将其保存到本地存储。
代码看起来像这样,
import scrapy
from scrapy.http import Request
class pwc_tax(scrapy.Spider):
name = "pwc_tax"
allowed_domains = ["www.sec.gov", 'search.usa.gov', 'secsearch.sec.gov']
start_urls = ["https://secsearch.sec.gov/search?utf8=%E2%9C%93&affiliate=secsearch&sort_by=&query=Exhibit+10%2F+EXHIBIT+11"]
def parse(self, response):
# extract search results
for link in response.xpath('//div[@id="results"]//h4[@class="title"]/a/@href').extract():
req = Request(url=link, callback=self.parse_page)
yield req
def parse_page(self, response):
# parse each search result here
pdf_files = response.xpath('//div[@class="article-file-download"]/a/@href').extract()
# base url wont be part of this pdf_files
# sample: [u'/files/18-03273-E.pdf']
# need to add at the beginning of each url
# response.urljoin() will do the task for you
for pdf in pdf_files:
if pdf.endswith('.pdf'):
pdf_url = response.urljoin(pdf)
req = Request(url=pdf_url, callback=self.save_pdf)
yield req
def save_pdf(self, response):
path = response.url.split('/')[-1]
self.logger.info('Saving PDF %s', path)
with open(path, 'wb') as f:
f.write(response.body)
我对 python 和 scrapy 还很陌生。我的任务是下载特定主题的 .PDF 文件。例如:这个网站上有更多合同** https://www.sec.gov/ ** 目前我正在一个一个地下载文件。我必须编写一个 scrapy 程序来使用搜索关键字下载所有相关的 .PDF 文件,例如 ** 关键字:Exhibit 10/ EXHIBIT 11 **
## My Code ##
#import urllib
import scrapy
from scrapy.http import Request
class pwc_tax(scrapy.Spider):
name = "pwc_tax"
allowed_domains = ["www.sec.gov"]
start_urls = ["https://www.sec.gov/cubist-pharmaceuticals-inc-exhibit-10-65-10-k405"]
def parse(self, response):
base_url = 'https://www.sec.gov/'
for a in response.xpath('//a[@href]/@href'):
link = a.extract()
# self.logger.info(link)
if link.endswith('.pdf'):
#link = urllib.parse.urljoin(base_url, link)
link = base_url + link
self.logger.info(link)
yield Request(link, callback=self.save_pdf)
def save_pdf(self, response):
path = response.url.split('/')[-1]
self.logger.info('Saving PDF %s', path)
with open(path, 'wb') as f:
f.write(response.body)
使用这个 scrapy 代码,我只能在给定的 URL 中下载 PDF。 例如:https://www.sec.gov/cubist-pharmaceuticals-inc-exhibit-10-65-10-k405 (如果我在上面给出 URL 文件已经下载但是为此我可以手动下载,我必须下载整个 PDF 是搜索项目) 如果我使用 Exhibit 10 关键字进行搜索,将出现以下页面 https://secsearch.sec.gov/search?utf8=%3F&affiliate=secsearch&query=exhibit+10,我想通过 scrapy 打开所有链接并下载所有 pdf。如果有人帮我解决这段代码。提前致谢。
您应该首先在 start_urls
中获取搜索查询 url 并从 start_url 的响应中提取所有 url 并将请求发送到每个 url他们。之后提取 pdf link 并将其保存到本地存储。
代码看起来像这样,
import scrapy
from scrapy.http import Request
class pwc_tax(scrapy.Spider):
name = "pwc_tax"
allowed_domains = ["www.sec.gov", 'search.usa.gov', 'secsearch.sec.gov']
start_urls = ["https://secsearch.sec.gov/search?utf8=%E2%9C%93&affiliate=secsearch&sort_by=&query=Exhibit+10%2F+EXHIBIT+11"]
def parse(self, response):
# extract search results
for link in response.xpath('//div[@id="results"]//h4[@class="title"]/a/@href').extract():
req = Request(url=link, callback=self.parse_page)
yield req
def parse_page(self, response):
# parse each search result here
pdf_files = response.xpath('//div[@class="article-file-download"]/a/@href').extract()
# base url wont be part of this pdf_files
# sample: [u'/files/18-03273-E.pdf']
# need to add at the beginning of each url
# response.urljoin() will do the task for you
for pdf in pdf_files:
if pdf.endswith('.pdf'):
pdf_url = response.urljoin(pdf)
req = Request(url=pdf_url, callback=self.save_pdf)
yield req
def save_pdf(self, response):
path = response.url.split('/')[-1]
self.logger.info('Saving PDF %s', path)
with open(path, 'wb') as f:
f.write(response.body)