为什么我的网络抓取工具不能从 CSV 文件中正确转换 URL 以供下载?
Why does my web scraper not convert the URL'S properly from a CSV for downloading?
我是 运行 一个抓取器,它从它可以从 r/dankmemes 在 reddit 上找到的图像中获取所有 url,然后将其转换为列表,最后它尝试下载这些文件,但由于某种原因发生错误。有人可以解释一下我做错了什么吗,我是 python.
的新手
追溯错误可追溯到 ("line38"): urllib.request.urlretrieve(image[0],'/Users/CENSORED/Desktop/Instagrammemes/image_' + str(img_count) + ".jpg")
错误信息:
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning)
Traceback (most recent call last):
File "/Users/CENSORED/Desktop/FirstImages/scraper.py", line 38, in <module>
urllib.request.urlretrieve(image[0],'/Users/CENSORED/Desktop/Instagrammemes/image_' + str(img_count) + ".jpg")
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 247, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 510, in open
req = Request(fullurl, data)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 328, in __init__
self.full_url = url
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 354, in full_url
self._parse()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 383, in _parse
raise ValueError("unknown url type: %r" % self.full_url)
ValueError: unknown url type: 'h'
我认为导致问题的代码:
with open('/Users/CENSORED/Desktop/FirstImages/file.csv') as images :
images = csv.reader(images)
img_count = 1
for image in images:
image = url.strip('\'"')
urllib.parse.quote(':')
urllib.request.urlretrieve(image[0],'/Users/CENSORED/Desktop/Instagrammemes/image_' + str(img_count) + ".jpg")
img_count += 1
文本文件:
['https://a.thumbs.redditmedia.com/JkyImC_zyl4XzE_yW-G4KOUTTFB6MRHUR3eEHvrpq64.png',
'https://www.redditstatic.com/desktop2x/img/renderTimingPixel.png', 'https://www.redditstatic.com/desktop2x/img/gold/badges/award-silver-cartoon.png',
'https://www.redditstatic.com/desktop2x/img/renderTimingPixel.png',
'https://www.redditstatic.com/desktop2x/img/renderTimingPixel.png',
'https://preview.redd.it/i6sdyng7n3h21.jpg?
width=640&crop=smart&auto=webp&s=1abb4b30f2b74f114f2743cf66bf3d0e7f618abf',
'https://www.redditstatic.com/desktop2x/img/renderTimingPixel.png',
'https://i.redd.it/m9q2841su3h21.jpg',
'https://www.redditstatic.com/desktop2x/img/renderTimingPixel.png',
'https://i.redd.it/tsp8qpamc3h21.png',
'https://www.redditstatic.com/desktop2x/img/gold/badges/award-silver-cartoon.png',
'https://www.redditstatic.com/desktop2x/img/renderTimingPixel.png',
'https://external-preview.redd.it/Ho2XSQOhaHGN3LhkLnPAf2OTkXwtuBTKQ9FXgdumH-I.jpg?
width=640&crop=smart&auto=webp&s=54356f6b63ea9f51953f6a42d6c77fa4bf47df44',
'https://www.redditstatic.com/desktop2x/img/renderTimingPixel.png',
'https://preview.redd.it/9j8389cno3h21.jpg?
width=640&crop=smart&auto=webp&s=23c0ef3307b8b8ebdc7c4bcc3d16837ad58e460a',
'https://www.redditstatic.com/desktop2x/img/renderTimingPixel.png',
'https://preview.redd.it/up1ouzug13h21.jpg?
width=640&crop=smart&auto=webp&s=584bb8c90056156c3d2483d6f4b1030f7bf4e27d', 'https://a.thumbs.redditmedia.com/JkyImC_zyl4XzE_yW-G4KOUTTFB6MRHUR3eEHvrpq64.png',
'https://styles.redditmedia.com/t5_2zmfe/styles/image_widget_3xmxw4p2gqu01.png',
'https://b.thumbs.redditmedia.com/aRUO-zIbXgMTDVJOcxKjY8P6rGkakMdyVXn4k1VN-Mk.png', 'https://b.thumbs.redditmedia.com/iL0Rq5QLIS6xVLwoYKL8na6ZaSa9tILrBbhBlMfjVdI.png', 'https://b.thumbs.redditmedia.com/9aAIqRjSQwF2C7Xohx1u2Q8nAUqmUsHqdYtAlhQZsgE.png',
'https://b.thumbs.redditmedia.com/voAwqXNBDO4JwIODmO4HXXkUJbnVo_mL_bENHeagDNo.png']
假设下面代码中的输入文件 (urls.txt) 如下所示:
["https://i.redd.it/m9q2841su3h21.jpg",
"https://www.redditstatic.com/desktop2x/img/renderTimingPixel.png",
"https://i.redd.it/tsp8qpamc3h21.png"]
下面的代码将图像下载到 c:\temp
import urllib.request as req
import json
with open('urls.json') as images:
images = json.load(images)
for idx, image_url in enumerate(images):
image_url = image_url.strip()
file_name = 'c:\temp\{}.{}'.format(idx,
image_url.strip().split('.')[-1])
print('About to download {} to file {}'.format(image_url, file_name))
req.urlretrieve(image_url, file_name)
输出:
About to download https://i.redd.it/m9q2841su3h21.jpg to file c:\temp[=12=].jpg
About to download https://www.redditstatic.com/desktop2x/img/renderTimingPixel.png to file c:\temp.png
About to download https://i.redd.it/tsp8qpamc3h21.png to file c:\temp.png
我是 运行 一个抓取器,它从它可以从 r/dankmemes 在 reddit 上找到的图像中获取所有 url,然后将其转换为列表,最后它尝试下载这些文件,但由于某种原因发生错误。有人可以解释一下我做错了什么吗,我是 python.
的新手追溯错误可追溯到 ("line38"): urllib.request.urlretrieve(image[0],'/Users/CENSORED/Desktop/Instagrammemes/image_' + str(img_count) + ".jpg")
错误信息:
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning)
Traceback (most recent call last):
File "/Users/CENSORED/Desktop/FirstImages/scraper.py", line 38, in <module>
urllib.request.urlretrieve(image[0],'/Users/CENSORED/Desktop/Instagrammemes/image_' + str(img_count) + ".jpg")
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 247, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 510, in open
req = Request(fullurl, data)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 328, in __init__
self.full_url = url
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 354, in full_url
self._parse()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 383, in _parse
raise ValueError("unknown url type: %r" % self.full_url)
ValueError: unknown url type: 'h'
我认为导致问题的代码:
with open('/Users/CENSORED/Desktop/FirstImages/file.csv') as images :
images = csv.reader(images)
img_count = 1
for image in images:
image = url.strip('\'"')
urllib.parse.quote(':')
urllib.request.urlretrieve(image[0],'/Users/CENSORED/Desktop/Instagrammemes/image_' + str(img_count) + ".jpg")
img_count += 1
文本文件:
['https://a.thumbs.redditmedia.com/JkyImC_zyl4XzE_yW-G4KOUTTFB6MRHUR3eEHvrpq64.png',
'https://www.redditstatic.com/desktop2x/img/renderTimingPixel.png', 'https://www.redditstatic.com/desktop2x/img/gold/badges/award-silver-cartoon.png',
'https://www.redditstatic.com/desktop2x/img/renderTimingPixel.png',
'https://www.redditstatic.com/desktop2x/img/renderTimingPixel.png',
'https://preview.redd.it/i6sdyng7n3h21.jpg?
width=640&crop=smart&auto=webp&s=1abb4b30f2b74f114f2743cf66bf3d0e7f618abf',
'https://www.redditstatic.com/desktop2x/img/renderTimingPixel.png',
'https://i.redd.it/m9q2841su3h21.jpg',
'https://www.redditstatic.com/desktop2x/img/renderTimingPixel.png',
'https://i.redd.it/tsp8qpamc3h21.png',
'https://www.redditstatic.com/desktop2x/img/gold/badges/award-silver-cartoon.png',
'https://www.redditstatic.com/desktop2x/img/renderTimingPixel.png',
'https://external-preview.redd.it/Ho2XSQOhaHGN3LhkLnPAf2OTkXwtuBTKQ9FXgdumH-I.jpg?
width=640&crop=smart&auto=webp&s=54356f6b63ea9f51953f6a42d6c77fa4bf47df44',
'https://www.redditstatic.com/desktop2x/img/renderTimingPixel.png',
'https://preview.redd.it/9j8389cno3h21.jpg?
width=640&crop=smart&auto=webp&s=23c0ef3307b8b8ebdc7c4bcc3d16837ad58e460a',
'https://www.redditstatic.com/desktop2x/img/renderTimingPixel.png',
'https://preview.redd.it/up1ouzug13h21.jpg?
width=640&crop=smart&auto=webp&s=584bb8c90056156c3d2483d6f4b1030f7bf4e27d', 'https://a.thumbs.redditmedia.com/JkyImC_zyl4XzE_yW-G4KOUTTFB6MRHUR3eEHvrpq64.png',
'https://styles.redditmedia.com/t5_2zmfe/styles/image_widget_3xmxw4p2gqu01.png',
'https://b.thumbs.redditmedia.com/aRUO-zIbXgMTDVJOcxKjY8P6rGkakMdyVXn4k1VN-Mk.png', 'https://b.thumbs.redditmedia.com/iL0Rq5QLIS6xVLwoYKL8na6ZaSa9tILrBbhBlMfjVdI.png', 'https://b.thumbs.redditmedia.com/9aAIqRjSQwF2C7Xohx1u2Q8nAUqmUsHqdYtAlhQZsgE.png',
'https://b.thumbs.redditmedia.com/voAwqXNBDO4JwIODmO4HXXkUJbnVo_mL_bENHeagDNo.png']
假设下面代码中的输入文件 (urls.txt) 如下所示:
["https://i.redd.it/m9q2841su3h21.jpg",
"https://www.redditstatic.com/desktop2x/img/renderTimingPixel.png",
"https://i.redd.it/tsp8qpamc3h21.png"]
下面的代码将图像下载到 c:\temp
import urllib.request as req
import json
with open('urls.json') as images:
images = json.load(images)
for idx, image_url in enumerate(images):
image_url = image_url.strip()
file_name = 'c:\temp\{}.{}'.format(idx,
image_url.strip().split('.')[-1])
print('About to download {} to file {}'.format(image_url, file_name))
req.urlretrieve(image_url, file_name)
输出:
About to download https://i.redd.it/m9q2841su3h21.jpg to file c:\temp[=12=].jpg
About to download https://www.redditstatic.com/desktop2x/img/renderTimingPixel.png to file c:\temp.png
About to download https://i.redd.it/tsp8qpamc3h21.png to file c:\temp.png