ValueError: Missing scheme in request url: h
ValueError: Missing scheme in request url: h
我是scrapy的初学者,python。我尝试在scrapinghub中部署爬虫代码,遇到如下错误。下面是代码。
import scrapy
from bs4 import BeautifulSoup,SoupStrainer
import urllib2
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
import re
import pkgutil
from pkg_resources import resource_string
from tues1402.items import Tues1402Item
data = pkgutil.get_data("tues1402","resources/urllist.txt")
class SpiderTuesday (scrapy.Spider):
name = 'tuesday'
self.start_urls = [url.strip() for url in data]
def parse(self, response):
story = Tues1402Item()
story['url'] = response.url
story['title'] = response.xpath("//title/text()").extract()
return story
是我的spider.py代码
import scrapy
class Tues1402Item(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
url = scrapy.Field()
是 items.py 代码和
from setuptools import setup, find_packages
setup(
name = 'tues1402',
version = '1.0',
packages = find_packages(),
entry_points = {'scrapy': ['settings = tues1402.settings']},
package_data = {'tues1402':['resources/urllist.txt']},
zip_safe = False,
)
是 setup.py 代码。
错误如下
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/scrapy/core/engine.py", line 126, in _next_request
request = next(slot.start_requests)
File "/usr/local/lib/python2.7/site-packages/scrapy/spiders/init.py", line 70, in start_requests
yield self.make_requests_from_url(url)
File "/usr/local/lib/python2.7/site-packages/scrapy/spiders/init.py", line 73, in make_requests_from_url
return Request(url, dont_filter=True)
File "/usr/local/lib/python2.7/site-packages/scrapy/http/request/init.py", line 25, in init
self._set_url(url)
File "/usr/local/lib/python2.7/site-packages/scrapy/http/request/init.py", line 57, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: h
提前致谢
您的错误意味着 url h
不是有效的 url。您应该打印出您的 self.start_urls
并查看那里有什么 url,您很可能有一个字符串 h
作为您的第一个 url.
似乎您的蜘蛛程序遍历文本而不是此处的 url 列表:
data = pkgutil.get_data("tues1402","resources/urllist.txt")
class SpiderTuesday (scrapy.Spider):
name = 'tuesday'
self.start_urls = [url.strip() for url in data]
假设您在 urllist.txt
文件中存储带有一些分隔符的 url,您应该将其拆分:
# assuming file has url in every line
self.start_urls = [url.strip() for url in data.splitlines()]
我是scrapy的初学者,python。我尝试在scrapinghub中部署爬虫代码,遇到如下错误。下面是代码。
import scrapy
from bs4 import BeautifulSoup,SoupStrainer
import urllib2
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
import re
import pkgutil
from pkg_resources import resource_string
from tues1402.items import Tues1402Item
data = pkgutil.get_data("tues1402","resources/urllist.txt")
class SpiderTuesday (scrapy.Spider):
name = 'tuesday'
self.start_urls = [url.strip() for url in data]
def parse(self, response):
story = Tues1402Item()
story['url'] = response.url
story['title'] = response.xpath("//title/text()").extract()
return story
是我的spider.py代码
import scrapy
class Tues1402Item(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
url = scrapy.Field()
是 items.py 代码和
from setuptools import setup, find_packages
setup(
name = 'tues1402',
version = '1.0',
packages = find_packages(),
entry_points = {'scrapy': ['settings = tues1402.settings']},
package_data = {'tues1402':['resources/urllist.txt']},
zip_safe = False,
)
是 setup.py 代码。
错误如下
Traceback (most recent call last): File "/usr/local/lib/python2.7/site-packages/scrapy/core/engine.py", line 126, in _next_request request = next(slot.start_requests) File "/usr/local/lib/python2.7/site-packages/scrapy/spiders/init.py", line 70, in start_requests yield self.make_requests_from_url(url) File "/usr/local/lib/python2.7/site-packages/scrapy/spiders/init.py", line 73, in make_requests_from_url return Request(url, dont_filter=True) File "/usr/local/lib/python2.7/site-packages/scrapy/http/request/init.py", line 25, in init self._set_url(url) File "/usr/local/lib/python2.7/site-packages/scrapy/http/request/init.py", line 57, in _set_url raise ValueError('Missing scheme in request url: %s' % self._url) ValueError: Missing scheme in request url: h
提前致谢
您的错误意味着 url h
不是有效的 url。您应该打印出您的 self.start_urls
并查看那里有什么 url,您很可能有一个字符串 h
作为您的第一个 url.
似乎您的蜘蛛程序遍历文本而不是此处的 url 列表:
data = pkgutil.get_data("tues1402","resources/urllist.txt")
class SpiderTuesday (scrapy.Spider):
name = 'tuesday'
self.start_urls = [url.strip() for url in data]
假设您在 urllist.txt
文件中存储带有一些分隔符的 url,您应该将其拆分:
# assuming file has url in every line
self.start_urls = [url.strip() for url in data.splitlines()]