蜘蛛 returns 仅 items.json 文件中的“[”
Spider returns only "[" in the items.json file
我编写了蜘蛛程序来从网站中提取图像。但是 items.json 文件中只有 [ 字符。
请帮我。
我的蜘蛛文件是这样的:-
# -*- coding: utf-8 -*-
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from captcha.items import CaptchaItem
class CaptchaSpider(CrawlSpider):
name = "CaptchaSpider"
allowed_domains = ["*****.ac.in"]
start_urls = [
"https://*****.ac.in/*****.asp"
]
def parse_item(self, response):
item = CaptchaItem()
hxs = HtmlXPathSelector(response)
item['im'] = hxs.select('//img/@src').extract()
return item
我的 items.py 文件是这样的:-
import scrapy
class CaptchaItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
im = scrapy.Field()
pass
如果抓取时出现错误或没有返回任何项目,输出文件将仅包含 [
。
你的情况是因为缩进,parse_item()
应该缩进:
class CaptchaSpider(CrawlSpider):
name = "CaptchaSpider"
allowed_domains = ["*****.ac.in"]
start_urls = [
"https://*****.ac.in/*****.asp"
]
def parse_item(self, response):
item = CaptchaItem()
hxs = HtmlXPathSelector(response)
item['im'] = hxs.select('//img/@src').extract()
return item
我已经实际测试并复制了它:
$ scrapy runspider spider.py -o items.json
...
$ cat items.json
[
我编写了蜘蛛程序来从网站中提取图像。但是 items.json 文件中只有 [ 字符。 请帮我。 我的蜘蛛文件是这样的:-
# -*- coding: utf-8 -*-
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from captcha.items import CaptchaItem
class CaptchaSpider(CrawlSpider):
name = "CaptchaSpider"
allowed_domains = ["*****.ac.in"]
start_urls = [
"https://*****.ac.in/*****.asp"
]
def parse_item(self, response):
item = CaptchaItem()
hxs = HtmlXPathSelector(response)
item['im'] = hxs.select('//img/@src').extract()
return item
我的 items.py 文件是这样的:-
import scrapy
class CaptchaItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
im = scrapy.Field()
pass
如果抓取时出现错误或没有返回任何项目,输出文件将仅包含 [
。
你的情况是因为缩进,parse_item()
应该缩进:
class CaptchaSpider(CrawlSpider):
name = "CaptchaSpider"
allowed_domains = ["*****.ac.in"]
start_urls = [
"https://*****.ac.in/*****.asp"
]
def parse_item(self, response):
item = CaptchaItem()
hxs = HtmlXPathSelector(response)
item['im'] = hxs.select('//img/@src').extract()
return item
我已经实际测试并复制了它:
$ scrapy runspider spider.py -o items.json
...
$ cat items.json
[