如何从目标中删除字符溢出

Question

我正在抓取网页，但出现错误，我无法处理

我有以下 url 但我无法让它工作需要删除所有字符在 jpg 之后，我用正则表达式尝试了它，但我不能也不想数数 url 上溢出的字符和索引将其删除，因为它不适用于所有图像

https://upload.wikimedia.org/wikipedia/commons/2/25/Lebron_wizards_2017_%28cropped%29.jpg/498px-Lebron_wizards_2017_%28cropped%29.jpg

这是我目前试过的代码

regex = re.sub('[^*\w]/+\w', '', image)

# and

sep = 'jpg/'
rest = image.split(sep, 1)[0]
print(rest)

但我失败了，我也在这里查看 this question 但我找不到任何解决方案，因为我的 url 包含奇怪和相似的字符。

预期结果如下所示

https://upload.wikimedia.org/wikipedia/commons/2/25/Lebron_wizards_2017_%28cropped%29.jpg

Answer 1

您可以尝试根据 / 拆分 URL 并删除最后一部分并使用 / 重新加入。

类似于：

# Input
i_string = "https://upload.wikimedia.org/wikipedia/commons/2/25/Lebron_wizards_2017_%28cropped%29.jpg/498px-Lebron_wizards_2017_%28cropped%29.jpg"
print(i_string)

# Approach 1
res = "/".join(i_string.split("/")[:-1])
print(res)


# Approach 2, using function
def remove_last_part(i_string: str) -> str:
    return "/".join(i_string.split("/")[:-1])


print(remove_last_part(i_string))

结果

https://upload.wikimedia.org/wikipedia/commons/2/25/Lebron_wizards_2017_%28cropped%29.jpg/498px-Lebron_wizards_2017_%28cropped%29.jpg
https://upload.wikimedia.org/wikipedia/commons/2/25/Lebron_wizards_2017_%28cropped%29.jpg
https://upload.wikimedia.org/wikipedia/commons/2/25/Lebron_wizards_2017_%28cropped%29.jpg

因为你想删除 jpg 之后的所有字符，你可以执行如下操作：

# Input
i_string = "https://upload.wikimedia.org/wikipedia/commons/2/25/Lebron_wizards_2017_%28cropped%29.jpg/498px-Lebron_wizards_2017_%28cropped%29.jpg"
print(i_string)

def remove_last_part(i_string: str, delimiter: str) -> str:
    return delimiter.join(i_string.split(delimiter)[:1]) + delimiter


print(remove_last_part(i_string, "jpg"))

结果：

https://upload.wikimedia.org/wikipedia/commons/2/25/Lebron_wizards_2017_%28cropped%29.jpg/498px-Lebron_wizards_2017_%28cropped%29.jpg
https://upload.wikimedia.org/wikipedia/commons/2/25/Lebron_wizards_2017_%28cropped%29.jpg

Answer 2

这将在给定 URL 中第一次出现 .jpg 时停止匹配。

import re

url = "https://upload.wikimedia.org/wikipedia/commons/2/25/Lebron_wizards_2017_%28cropped%29.jpg/498px-Lebron_wizards_2017_%28cropped%29.jpg"
regex = r"^https?://[^\s/$.?#].[^\s]*?(?:\.jpg)"
result = re.match(regex, url)
print(result.group())

输出：

https://upload.wikimedia.org/wikipedia/commons/2/25/Lebron_wizards_2017_%28cropped%29.jpg

https://regex101.com/r/i6mu3z/1

如何从目标中删除字符溢出

How can I remove characters overflow from the target one

python

python-re