使用 scrapy 关注新闻链接
Follow news links with scrapy
我是 crawl 和 scrapy 的新手,我正在尝试从 https://www.lacuarta.com/ 中提取一些新闻,也只是与标签 san-valentin.
匹配的新闻
该网页只是带有新闻图片的标题,如果您想阅读它,您必须点击新闻,它会将ypu 带到故事的页面(https://www.lacuarta.com/etiqueta/san-valentin/)
所以,我认为我的步骤是:
- 转到与我想要的标签相匹配的页面,在本例中为 san-valentin
- 从新闻中提取网址
- 进入新闻页面
- 提取我想要的数据
我已经有了第 1 点和第 2 点:
import scrapy
class SpiderTags(scrapy.Spider):
name = "SpiderTags"
def start_requests(self):
url = 'https://www.lacuarta.com/etiqueta/'
tag = getattr(self, 'tag', None)
if tag is not None:
url = url + 'etiqueta/' + tag
yield scrapy.Request(url, self.parse)
def parse(self, response):
for url in response.css("h4.normal a::attr(href)"):
yield{
"link:": url.get()
}
到这里我有新闻的链接,现在我不知道如何输入该新闻以提取我想要的数据然后返回到我的原始网页转到第 2 页并重复所有内容
PD:我想要的信息已经知道如何获取了
- 标题:
response.css("title::text").get()
- 故事:
response.css("div.col-md-11 p::text").getall()
- 作者:
response.css("div.col-sm-6 h4 a::text").getall()
- 日期:
response.css("div.col-sm-6 h4 small span::text").getall()
您需要 yield
一个新的 Request
才能 关注 link。例如:
def parse(self, response):
for url in response.css("h4.normal a::attr(href)"):
# This will get the URL value, not follow it:
# yield{
# "link:": url.get()
# }
# This will follow the URL:
yield scrapy.Request(url.get(), self.parse_news_item)
def parse_news_item(self, response):
# Extract things from the news item page.
yield {
'Title': response.css("title::text").get(),
'Story': response.css("div.col-md-11 p::text").getall(),
'Author': response.css("div.col-sm-6 h4 a::text").getall(),
'Date': response.css("div.col-sm-6 h4 small span::text").getall(),
}
import scrapy
from scrapy.spiders import CrawlSpider
class SpiderName(CrawlSpider):
name = 'spidername'
allowed_domains = ['lacuarta.com']
start_urls = ['https://www.lacuarta.com/etiqueta/san-valentin/']
def parse(self, response):
for item in response.xpath('//article[@class="archive-article modulo-fila"]'):
# maybe you need more data whithin `item`
post_url = item.xpath('.//h4/a/@href').extract_first()
yield response.follow(post_url, callback=self.post_parse)
next_page = response.xpath('//li[@class="active"]/following-sibling::li/a/@href').extract_first()
if next_page:
yield response.follow(next_page, callback=self.parse)
def post_parse(self, response):
title = response.xpath('//h1/text()').extract_first()
story = response.xpath('//div[@id="ambideXtro"]/child::*').extract()
author = response.xpath('//div[@class="col-sm-6 m-top-10"]/h4/a/text()').extract_first()
date = response.xpath('//span[@class="ltpicto-calendar"]').extract_first()
yield {'title': title, 'story': story, 'author': author, 'date': date}
我是 crawl 和 scrapy 的新手,我正在尝试从 https://www.lacuarta.com/ 中提取一些新闻,也只是与标签 san-valentin.
匹配的新闻该网页只是带有新闻图片的标题,如果您想阅读它,您必须点击新闻,它会将ypu 带到故事的页面(https://www.lacuarta.com/etiqueta/san-valentin/)
所以,我认为我的步骤是:
- 转到与我想要的标签相匹配的页面,在本例中为 san-valentin
- 从新闻中提取网址
- 进入新闻页面
- 提取我想要的数据
我已经有了第 1 点和第 2 点:
import scrapy
class SpiderTags(scrapy.Spider):
name = "SpiderTags"
def start_requests(self):
url = 'https://www.lacuarta.com/etiqueta/'
tag = getattr(self, 'tag', None)
if tag is not None:
url = url + 'etiqueta/' + tag
yield scrapy.Request(url, self.parse)
def parse(self, response):
for url in response.css("h4.normal a::attr(href)"):
yield{
"link:": url.get()
}
到这里我有新闻的链接,现在我不知道如何输入该新闻以提取我想要的数据然后返回到我的原始网页转到第 2 页并重复所有内容
PD:我想要的信息已经知道如何获取了
- 标题:
response.css("title::text").get()
- 故事:
response.css("div.col-md-11 p::text").getall()
- 作者:
response.css("div.col-sm-6 h4 a::text").getall()
- 日期:
response.css("div.col-sm-6 h4 small span::text").getall()
您需要 yield
一个新的 Request
才能 关注 link。例如:
def parse(self, response):
for url in response.css("h4.normal a::attr(href)"):
# This will get the URL value, not follow it:
# yield{
# "link:": url.get()
# }
# This will follow the URL:
yield scrapy.Request(url.get(), self.parse_news_item)
def parse_news_item(self, response):
# Extract things from the news item page.
yield {
'Title': response.css("title::text").get(),
'Story': response.css("div.col-md-11 p::text").getall(),
'Author': response.css("div.col-sm-6 h4 a::text").getall(),
'Date': response.css("div.col-sm-6 h4 small span::text").getall(),
}
import scrapy
from scrapy.spiders import CrawlSpider
class SpiderName(CrawlSpider):
name = 'spidername'
allowed_domains = ['lacuarta.com']
start_urls = ['https://www.lacuarta.com/etiqueta/san-valentin/']
def parse(self, response):
for item in response.xpath('//article[@class="archive-article modulo-fila"]'):
# maybe you need more data whithin `item`
post_url = item.xpath('.//h4/a/@href').extract_first()
yield response.follow(post_url, callback=self.post_parse)
next_page = response.xpath('//li[@class="active"]/following-sibling::li/a/@href').extract_first()
if next_page:
yield response.follow(next_page, callback=self.parse)
def post_parse(self, response):
title = response.xpath('//h1/text()').extract_first()
story = response.xpath('//div[@id="ambideXtro"]/child::*').extract()
author = response.xpath('//div[@class="col-sm-6 m-top-10"]/h4/a/text()').extract_first()
date = response.xpath('//span[@class="ltpicto-calendar"]').extract_first()
yield {'title': title, 'story': story, 'author': author, 'date': date}