Python 从网络下载图像时出错(HTTP 错误 400:错误请求)
Python error downloading image from web (HTTP Error 400: Bad Request)
我们是 Python 的新手,我们目前正在尝试使用特定关键字从 Google 图像下载一些图像以制作数据集。
我们找到了一个很好的教程 here,但我们无法使代码正常工作。我们在 Windows 10 上使用 Python 3.6 和 Google Chrome 68.0.3440.84。代码是:
import os
import urllib.request as ulib
from bs4 import BeautifulSoup as Soup
import json
url_a = 'https://www.google.com/search?ei=1m7NWePfFYaGmQG51q7IBg&hl=en&q={}'
url_b = '\&tbm=isch&ved=0ahUKEwjjovnD7sjWAhUGQyYKHTmrC2kQuT0I7gEoAQ&start={}'
url_c = '\&yv=2&vet=10ahUKEwjjovnD7sjWAhUGQyYKHTmrC2kQuT0I7gEoAQ.1m7NWePfFYaGmQG51q7IBg'
url_d = '\.i&ijn=1&asearch=ichunk&async=_id:rg_s,_pms:s'
url_base = ''.join((url_a, url_b, url_c, url_d))
headers = {'User-Agent': 'Chrome/41.0.2228.0 Safari/537.36'}
def get_links(search_name):
search_name = search_name.replace(' ', '+')
url = url_base.format(search_name, 0)
request = ulib.Request(url, None, headers)
json_string = ulib.urlopen(request).read()
page = json.loads(json_string)
new_soup = Soup(page[1][1], 'lxml')
images = new_soup.find_all('img')
links = [image['src'] for image in images]
return links
def save_images(links, search_name):
directory = search_name.replace(' ', '_')
if not os.path.isdir(directory):
os.mkdir(directory)
for i, link in enumerate(links):
savepath = os.path.join(directory, '{:06}.png'.format(i))
ulib.urlretrieve(link, savepath)
if __name__ == '__main__':
search_name = 'referee'
links = get_links(search_name)
save_images(links, search_name)
我们已经尝试更改用户代理,但没有任何改变。然后,我们尝试去掉url_base中的url_d字符串,改了错误在:
in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
如果有人有任何建议或想法,请告诉我们。
提前谢谢你
您没有取回 json
数据。将 get_links
更改为:
def get_links(search_name):
search_name = search_name.replace(' ', '+')
url = url_base.format(search_name, 0)
request = ulib.Request(url, None, headers)
data = ulib.urlopen(request).read()
new_soup = Soup(data, 'lxml')
images = new_soup.find_all('img')
links = [image['src'] for image in images]
return links
在 运行 您的脚本更新后:
(661) $ ls referee/
000000.png 000002.png 000004.png 000006.png 000008.png 000010.png 000012.png 000014.png 000016.png 000018.png
000001.png 000003.png 000005.png 000007.png 000009.png 000011.png 000013.png 000015.png 000017.png 000019.png
(662) $ file referee/000000.png
referee/000000.png: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 145x97, frames 3
我们是 Python 的新手,我们目前正在尝试使用特定关键字从 Google 图像下载一些图像以制作数据集。 我们找到了一个很好的教程 here,但我们无法使代码正常工作。我们在 Windows 10 上使用 Python 3.6 和 Google Chrome 68.0.3440.84。代码是:
import os
import urllib.request as ulib
from bs4 import BeautifulSoup as Soup
import json
url_a = 'https://www.google.com/search?ei=1m7NWePfFYaGmQG51q7IBg&hl=en&q={}'
url_b = '\&tbm=isch&ved=0ahUKEwjjovnD7sjWAhUGQyYKHTmrC2kQuT0I7gEoAQ&start={}'
url_c = '\&yv=2&vet=10ahUKEwjjovnD7sjWAhUGQyYKHTmrC2kQuT0I7gEoAQ.1m7NWePfFYaGmQG51q7IBg'
url_d = '\.i&ijn=1&asearch=ichunk&async=_id:rg_s,_pms:s'
url_base = ''.join((url_a, url_b, url_c, url_d))
headers = {'User-Agent': 'Chrome/41.0.2228.0 Safari/537.36'}
def get_links(search_name):
search_name = search_name.replace(' ', '+')
url = url_base.format(search_name, 0)
request = ulib.Request(url, None, headers)
json_string = ulib.urlopen(request).read()
page = json.loads(json_string)
new_soup = Soup(page[1][1], 'lxml')
images = new_soup.find_all('img')
links = [image['src'] for image in images]
return links
def save_images(links, search_name):
directory = search_name.replace(' ', '_')
if not os.path.isdir(directory):
os.mkdir(directory)
for i, link in enumerate(links):
savepath = os.path.join(directory, '{:06}.png'.format(i))
ulib.urlretrieve(link, savepath)
if __name__ == '__main__':
search_name = 'referee'
links = get_links(search_name)
save_images(links, search_name)
我们已经尝试更改用户代理,但没有任何改变。然后,我们尝试去掉url_base中的url_d字符串,改了错误在:
in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
如果有人有任何建议或想法,请告诉我们。 提前谢谢你
您没有取回 json
数据。将 get_links
更改为:
def get_links(search_name):
search_name = search_name.replace(' ', '+')
url = url_base.format(search_name, 0)
request = ulib.Request(url, None, headers)
data = ulib.urlopen(request).read()
new_soup = Soup(data, 'lxml')
images = new_soup.find_all('img')
links = [image['src'] for image in images]
return links
在 运行 您的脚本更新后:
(661) $ ls referee/
000000.png 000002.png 000004.png 000006.png 000008.png 000010.png 000012.png 000014.png 000016.png 000018.png
000001.png 000003.png 000005.png 000007.png 000009.png 000011.png 000013.png 000015.png 000017.png 000019.png
(662) $ file referee/000000.png
referee/000000.png: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 145x97, frames 3