将 url 与单词列表中的单词组合时出现 URLOpen 错误
URLOpen Error while combining url with word from wordlist
大家好,我正在制作 Python 网络爬虫。所以我有一个 link,最后一个字符是:"search?q=" 然后我使用我之前加载到列表中的单词列表。但是当我尝试打开它时: urllib2.urlopen(url) 它抛出一个错误 (urlopen error no host given) 。但是当我正常打开 link 和 urllib 时(所以输入通常自动粘贴的单词)它工作正常。那么你能告诉我为什么会这样吗?
感谢和问候
完整错误:
File "C:/Users/David/PycharmProjects/GetAppResults/main.py", line 61, in <module>
getResults()
File "C:/Users/David/PycharmProjects/GetAppResults/main.py", line 40, in getResults
usock = urllib2.urlopen(url)
File "C:\Python27\lib\urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 402, in open
req = meth(req)
File "C:\Python27\lib\urllib2.py", line 1113, in do_request_
raise URLError('no host given')
urllib2.URLError: <urlopen error no host given>
代码:
with open(filePath, "r") as ins:
wordList = []
for line in ins:
wordList.append(line)
def getResults():
packageID = ""
count = 0
word = "Test"
for x in wordList:
word = x;
print word
url = 'http://www.example.com/search?q=' + word
usock = urllib2.urlopen(url)
page_source = usock.read()
usock.close()
print page_source
startSequence = "data-docid=\""
endSequence = "\""
while page_source.find(startSequence) != -1:
start = page_source.find(startSequence) + len(startSequence)
end = page_source.find(endSequence, start)
print str(start);
print str(end);
link = page_source[start:end]
print link
if link:
if not link in packageID:
packageID += link + "\r\n"
print packageID
page_source = page_source[end + len(endSequence):]
count+=1
所以当我打印字符串单词时,它会从单词列表中输出正确的单词
Note that urlopen() returns a response, not a request.
您的代理配置可能已损坏;验证您的代理是否正常工作:
print(urllib.request.getproxies())
或完全绕过代理支持:
url = urllib.request.urlopen(
"http://www.example.com/search?q="+text_to_check
proxies={})
将 URL 与 Wordlist 中的单词组合的示例方法。它结合列表词从 url 中获取图像并下载它。循环它以访问您拥有的整个列表。
import urllib
import re
print "The URL crawler starts.."
mylist =["http://www.ebay","https://www.npmjs.org/"]
wordlist = [".com","asss"]
x = 1
urlcontent = urllib.urlopen(mylist[0]+wordlist[0]).read()
imgUrls = re.findall('img .*?src="(.*?)"',urlcontent)
for imgUrl in imgUrls:
img = imgUrl
print img
urllib.urlretrieve(img,str(x)+".jpg")
x= x + 1
希望这对您有所帮助,否则 post 您的代码和错误日志。
我解决了问题。我现在只是使用 urrlib 而不是 urllib2,一切正常谢谢大家:)
大家好,我正在制作 Python 网络爬虫。所以我有一个 link,最后一个字符是:"search?q=" 然后我使用我之前加载到列表中的单词列表。但是当我尝试打开它时: urllib2.urlopen(url) 它抛出一个错误 (urlopen error no host given) 。但是当我正常打开 link 和 urllib 时(所以输入通常自动粘贴的单词)它工作正常。那么你能告诉我为什么会这样吗?
感谢和问候
完整错误:
File "C:/Users/David/PycharmProjects/GetAppResults/main.py", line 61, in <module>
getResults()
File "C:/Users/David/PycharmProjects/GetAppResults/main.py", line 40, in getResults
usock = urllib2.urlopen(url)
File "C:\Python27\lib\urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 402, in open
req = meth(req)
File "C:\Python27\lib\urllib2.py", line 1113, in do_request_
raise URLError('no host given')
urllib2.URLError: <urlopen error no host given>
代码:
with open(filePath, "r") as ins:
wordList = []
for line in ins:
wordList.append(line)
def getResults():
packageID = ""
count = 0
word = "Test"
for x in wordList:
word = x;
print word
url = 'http://www.example.com/search?q=' + word
usock = urllib2.urlopen(url)
page_source = usock.read()
usock.close()
print page_source
startSequence = "data-docid=\""
endSequence = "\""
while page_source.find(startSequence) != -1:
start = page_source.find(startSequence) + len(startSequence)
end = page_source.find(endSequence, start)
print str(start);
print str(end);
link = page_source[start:end]
print link
if link:
if not link in packageID:
packageID += link + "\r\n"
print packageID
page_source = page_source[end + len(endSequence):]
count+=1
所以当我打印字符串单词时,它会从单词列表中输出正确的单词
Note that urlopen() returns a response, not a request.
您的代理配置可能已损坏;验证您的代理是否正常工作:
print(urllib.request.getproxies())
或完全绕过代理支持:
url = urllib.request.urlopen(
"http://www.example.com/search?q="+text_to_check
proxies={})
将 URL 与 Wordlist 中的单词组合的示例方法。它结合列表词从 url 中获取图像并下载它。循环它以访问您拥有的整个列表。
import urllib
import re
print "The URL crawler starts.."
mylist =["http://www.ebay","https://www.npmjs.org/"]
wordlist = [".com","asss"]
x = 1
urlcontent = urllib.urlopen(mylist[0]+wordlist[0]).read()
imgUrls = re.findall('img .*?src="(.*?)"',urlcontent)
for imgUrl in imgUrls:
img = imgUrl
print img
urllib.urlretrieve(img,str(x)+".jpg")
x= x + 1
希望这对您有所帮助,否则 post 您的代码和错误日志。
我解决了问题。我现在只是使用 urrlib 而不是 urllib2,一切正常谢谢大家:)