Scrapy,如何更改输入表单中的值,提交然后抓取页面
Scrapy, how to change value in input form, submit and then scrape page
我想在文本输入字段中输入一个值,然后提交表单,并在表单提交后抓取页面上的新数据
这怎么可能?
这是页面上的 html 表格。我想将输入值从 10 更改为 100 并提交表单
<form action="https://de.iss.fst.com/ba-u6-72-nbr-902-112-x-140-x-13-12-mm-simmerringr-ba-a-mit-feder-fst-40411416#product-offers-anchor" method="post" _lpchecked="1">
<div class="fieldset">
<div class="field qty">
<div class="control">
<label class="label" for="qty-2">
<span>Preise für</span>
</label>
<input type="text" name="pieces" class="validate-length maximum-length-10 qty" maxlength="12" id="qty-2" value="10">
<label class="label" for="qty-2">
<span>Teile</span>
</label>
<span class="actions">
<button type="submit" title="Absenden" class="action">
<span>Absenden</span>
</button>
</span>
</div>
</div>
</div>
</form>
更新!
新的工作代码。
import scrapy
import pymongo
from scrapy_splash import SplashRequest, SplashFormRequest
from issfst.items import IssfstItem
class IssSpider(scrapy.Spider):
name = "issfst_spider"
start_urls = ["https://de.iss.fst.com/dichtungen/radialwellendichtringe/rwdr-mit-geschlossenem-kafig/ba"]
custom_settings = {
# specifies exported fields and order
'FEED_EXPORT_FIELDS': ["imgurl",
"Produktdatenblatt",
"Materialdatenblatt",]
}
def parse(self, response):
self.log("I just visted:" + response.url)
urls = response.css('.details-button > a::attr(href)').extract()
for url in urls:
formdata = {'pieces': '200'}
yield SplashFormRequest.from_response(
response,
url=url,
formdata=formdata,
callback=self.parse_details,
args={'wait': 3}
)
# follow paignation link
next_page_url = response.css('li.item > a.next::attr(href)').extract_first()
if next_page_url:
next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(url=next_page_url, callback=self.parse)
def parse_details(self, response):
item = IssfstItem()
# scrape image url
item['imgurl'] = response.css('img.fotorama__img::attr(src)').extract(),
# scrape download pdf links
item['Produktdatenblatt'] = response.css('a.action[data-group="productdatasheet"]::attr(href)').extract_first(),
item['Materialdatenblatt'] = response.css( 'a.action[data-group="materialdatasheet"]::attr(href)').extract_first(),
item['Beschreibung'] = response.css('.description > p::text').extract_first(),
yield item
您不应该参考 html 源代码来了解 POST 请求的参数名称。您应该使用您喜欢的浏览器的开发者工具,并在保存日志的同时查看网络。
因此,您正在使用参数 pieces
和 form_key
.
寻找 url https://de.iss.fst.com/ba-72-nbr-902-155-x-174-x-12-0-mm-simmerringr-ba-a-mit-feder-fst-40411424#product-offers-anchor 和 POST
当您使用错误的名称 'value'
设置表单数据时出错,而网站需要名称 'pieces'
。
现在,作为 scrapy shell 会话中的演示:
scrapy shell "https://de.iss.fst.com/ba-72-nbr-902-155-x-174-x-12-0-mm-simmerringr-ba-a-mit-feder-fst-40411424"
...
from scrapy import FormRequest
##SETTING POST'S PARAMETERS
form_key = response.css('[name="form_key"]::attr(value)').get()
#Note response.xpath('input[@name="form_key"]/@value') returns nothing
#as far as I know for hidden element like this, css selection is the basic solution
pieces = "100"
form_data = {'form_key':form_key,'pieces':pieces} #with the correct names
##POST THE REQUEST
fetch(
FormRequest(
'https://de.iss.fst.com/ba-72-nbr-902-155-x-174-x-12-0-mm-simmerringr-ba-a-mit-feder-fst-40411424#product-offers-anchor',
formdata=form_data)
)#note the add of '#product-offers-anchor' to the url, instead it won't work
view(response) #to see the page your default browser
现在您可以根据您的代码调整以上内容。
我想在文本输入字段中输入一个值,然后提交表单,并在表单提交后抓取页面上的新数据 这怎么可能?
这是页面上的 html 表格。我想将输入值从 10 更改为 100 并提交表单
<form action="https://de.iss.fst.com/ba-u6-72-nbr-902-112-x-140-x-13-12-mm-simmerringr-ba-a-mit-feder-fst-40411416#product-offers-anchor" method="post" _lpchecked="1">
<div class="fieldset">
<div class="field qty">
<div class="control">
<label class="label" for="qty-2">
<span>Preise für</span>
</label>
<input type="text" name="pieces" class="validate-length maximum-length-10 qty" maxlength="12" id="qty-2" value="10">
<label class="label" for="qty-2">
<span>Teile</span>
</label>
<span class="actions">
<button type="submit" title="Absenden" class="action">
<span>Absenden</span>
</button>
</span>
</div>
</div>
</div>
</form>
更新! 新的工作代码。
import scrapy
import pymongo
from scrapy_splash import SplashRequest, SplashFormRequest
from issfst.items import IssfstItem
class IssSpider(scrapy.Spider):
name = "issfst_spider"
start_urls = ["https://de.iss.fst.com/dichtungen/radialwellendichtringe/rwdr-mit-geschlossenem-kafig/ba"]
custom_settings = {
# specifies exported fields and order
'FEED_EXPORT_FIELDS': ["imgurl",
"Produktdatenblatt",
"Materialdatenblatt",]
}
def parse(self, response):
self.log("I just visted:" + response.url)
urls = response.css('.details-button > a::attr(href)').extract()
for url in urls:
formdata = {'pieces': '200'}
yield SplashFormRequest.from_response(
response,
url=url,
formdata=formdata,
callback=self.parse_details,
args={'wait': 3}
)
# follow paignation link
next_page_url = response.css('li.item > a.next::attr(href)').extract_first()
if next_page_url:
next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(url=next_page_url, callback=self.parse)
def parse_details(self, response):
item = IssfstItem()
# scrape image url
item['imgurl'] = response.css('img.fotorama__img::attr(src)').extract(),
# scrape download pdf links
item['Produktdatenblatt'] = response.css('a.action[data-group="productdatasheet"]::attr(href)').extract_first(),
item['Materialdatenblatt'] = response.css( 'a.action[data-group="materialdatasheet"]::attr(href)').extract_first(),
item['Beschreibung'] = response.css('.description > p::text').extract_first(),
yield item
您不应该参考 html 源代码来了解 POST 请求的参数名称。您应该使用您喜欢的浏览器的开发者工具,并在保存日志的同时查看网络。
因此,您正在使用参数 pieces
和 form_key
.
当您使用错误的名称 'value'
设置表单数据时出错,而网站需要名称 'pieces'
。
现在,作为 scrapy shell 会话中的演示:
scrapy shell "https://de.iss.fst.com/ba-72-nbr-902-155-x-174-x-12-0-mm-simmerringr-ba-a-mit-feder-fst-40411424"
...
from scrapy import FormRequest
##SETTING POST'S PARAMETERS
form_key = response.css('[name="form_key"]::attr(value)').get()
#Note response.xpath('input[@name="form_key"]/@value') returns nothing
#as far as I know for hidden element like this, css selection is the basic solution
pieces = "100"
form_data = {'form_key':form_key,'pieces':pieces} #with the correct names
##POST THE REQUEST
fetch(
FormRequest(
'https://de.iss.fst.com/ba-72-nbr-902-155-x-174-x-12-0-mm-simmerringr-ba-a-mit-feder-fst-40411424#product-offers-anchor',
formdata=form_data)
)#note the add of '#product-offers-anchor' to the url, instead it won't work
view(response) #to see the page your default browser
现在您可以根据您的代码调整以上内容。