如何使用 Python 抓取图像，同时忽略 URL 中的高度和宽度？

Question

我正在尝试编写一个 Python 脚本来从 API 下载图像。
API returns 图像格式如下：

https://whosebug.com/media/GetImage?ID=98383838&imageName=03833883.jpg&width=640&height=480`

每张图片各占一行。我正在尝试使用 urllib，但正在努力弄清楚如何忽略 width/height 处理每个 jpg，因为我想要完整尺寸的图像而不是 640x480 的图像。

我一直在测试以下内容：

import urllib
import re

input_file = open('imgurls.txt','r')
x=0
for line in input_file:
    URL= line

    urllib.urlretrieve(URL, str(x) + ".jpg")
    x+=1

我不确定如何解决 width/height 问题。
我相信我应该使用 rsplit 但不确定。
如果它正在读取的行不是 URL 以避免错误，我还需要移至下一行。

Answer 1

您可以从 URL 中分离出最后两个查询参数，然后将 URL 加入回来。

url = 'https://whosebug.com/media/GetImage?ID=98383838&imageName=03833883.jpg&width=640&height=480'
full_img_url = '&'.join(url.split('&')[:-2])

# 'https://whosebug.com/media/GetImage?ID=98383838&imageName=03833883.jpg'

这假设宽度和高度始终在最后。

Answer 2

cricket_007 的回答对我来说很棒。一种稍微更稳健的方法可能是使用 urlparse 来分解 URL，删除不需要的查询参数并重建它：

import urlparse
url = 'https://whosebug.com/media/GetImage?ID=98383838&imageName=03833883.jpg&width=640&height=480'
parsed = urlparse.urlparse(url)
query = parsed.query
parsed_query = urlparse.parse_qs(query)
parsed_query.pop('width', None)
parsed_query.pop('height', None)
result = urlparse.urlunparse((parsed.scheme, parsed.netloc, parsed.path, parsed.params, urllib.urlencode(parsed_query, True), parsed.fragment))

如何使用 Python 抓取图像，同时忽略 URL 中的高度和宽度？

How do I scrape images using Python while ignoring their height & width in the URL?

python

url

urllib