抓取和修改输出
Scraping and modifying an outout
我正在尝试使用 scraping
从该网站检索数据:
https://dolar.wilkinsonpc.com.co/dolar-historico/dolar-historico-2018.html
我的解析器现在看起来像这样:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector
from w3lib.html import remove_tags
class HDolarSpider(scrapy.Spider):
name = 'historico-dolar'
allowed_domains = ['dolar.wilkinsonpc.com.co']
start_urls = ['https://dolar.wilkinsonpc.com.co/dolar-historico/dolar-historico-2018.html']
def parse(self, response):
for sel in response.xpath('//*[@id="tabla_dh"]'):
date = sel.xpath('/html/body/div[3]/div[5]/div[1]/div/div/div[3]/div/div[5]/div[1]').extract()
location = sel.xpath('/html/body/div[3]/div[5]/div[1]/div/div/div[3]/div/div[5]/div[2]').extract()
print(date, location)
输出如下:
['<div class="dh_col_fecha">16 Septiembre 2018</div>'] ['<div class="dh_col_precio"><b>$ 3,026.05</b></div>']
我需要这样:
2018 年 9 月 16 日;3026.05
我试图用 w3lib 和其他人替换,但没有成功。谁能帮帮我?
use/modify 这段代码:
# -*- coding: utf-8 -*-
import scrapy
class HDolarSpider(scrapy.Spider):
name = 'historico-dolar'
allowed_domains = ['dolar.wilkinsonpc.com.co']
start_urls = ['https://dolar.wilkinsonpc.com.co/dolar-historico/dolar-historico-2018.html']
def parse(self, response):
# Select all div containing a div with a class whose name contains the phrase "dh_cal_fecha"
for subject in response.xpath('//div[@id="tabla_dh"]/div[./div[contains(@class, "dh_col_fecha")]]'):
yield {
'date': subject.xpath('./div[@class="dh_col_fecha"]/text()').extract_first(),
'location': subject.xpath('./div[@class="dh_col_precio"]//text()').extract_first(),
}
如果您 运行 使用此代码:
scrapy runspider HDolarSpider.py -o Report.json
您将生成一个 JSON 格式的报告,其结构如下例所示:
共 262 项。
我正在尝试使用 scraping
从该网站检索数据:
https://dolar.wilkinsonpc.com.co/dolar-historico/dolar-historico-2018.html
我的解析器现在看起来像这样:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector
from w3lib.html import remove_tags
class HDolarSpider(scrapy.Spider):
name = 'historico-dolar'
allowed_domains = ['dolar.wilkinsonpc.com.co']
start_urls = ['https://dolar.wilkinsonpc.com.co/dolar-historico/dolar-historico-2018.html']
def parse(self, response):
for sel in response.xpath('//*[@id="tabla_dh"]'):
date = sel.xpath('/html/body/div[3]/div[5]/div[1]/div/div/div[3]/div/div[5]/div[1]').extract()
location = sel.xpath('/html/body/div[3]/div[5]/div[1]/div/div/div[3]/div/div[5]/div[2]').extract()
print(date, location)
输出如下:
['<div class="dh_col_fecha">16 Septiembre 2018</div>'] ['<div class="dh_col_precio"><b>$ 3,026.05</b></div>']
我需要这样:
2018 年 9 月 16 日;3026.05
我试图用 w3lib 和其他人替换,但没有成功。谁能帮帮我?
use/modify 这段代码:
# -*- coding: utf-8 -*-
import scrapy
class HDolarSpider(scrapy.Spider):
name = 'historico-dolar'
allowed_domains = ['dolar.wilkinsonpc.com.co']
start_urls = ['https://dolar.wilkinsonpc.com.co/dolar-historico/dolar-historico-2018.html']
def parse(self, response):
# Select all div containing a div with a class whose name contains the phrase "dh_cal_fecha"
for subject in response.xpath('//div[@id="tabla_dh"]/div[./div[contains(@class, "dh_col_fecha")]]'):
yield {
'date': subject.xpath('./div[@class="dh_col_fecha"]/text()').extract_first(),
'location': subject.xpath('./div[@class="dh_col_precio"]//text()').extract_first(),
}
如果您 运行 使用此代码:
scrapy runspider HDolarSpider.py -o Report.json
您将生成一个 JSON 格式的报告,其结构如下例所示:
共 262 项。