正则表达式在没有分隔符的文本中分隔 url

Question

对另一个正则表达式问题表示歉意！

我有一些输入文本，但毫无用处，有多个 url（只有 url）在一行上，没有分隔符

https://00e9e64bac25fa94607-apidata.googleusercontent.com/download/redacted?qk=AD5uMEnaGx-JIkLyJmEF7IjjU8bQfv_hZTkH_KOeaGZySsQCmdSPZEPHHAzUaUkcDAOZghttps://console.developers.google.com/project/reducted/?authuser=1\n

此示例仅包含两个 url，但可能更多。

我正在尝试使用 python

将 url 分隔成一个列表

我已经尝试寻找解决方案并尝试了一些，但无法使其正常工作，因为它们贪婪地消耗了所有后续 urls。

我意识到这可能是因为 https://... 可能在 url 的查询部分是合法允许的，但就我而言，我愿意假设它不能，并假设当它发生时，它是下一个 url.

的开始

我也试过 (http[s]://.*?) 但是有和没有 ? 要么让它获得全部文本，要么只获得 https://

Answer 1

(https?:\/\/(?:(?!https?:\/\/).)*)

尝试 this.See 演示。

https://regex101.com/r/tX2bH4/15

import re
p = re.compile(r'(https?:\/\/(?:(?!https?:\/\/).)*)')
test_str = "https://00e9e64bac25fa94607-apidata.googleusercontent.com/download/redacted?qk=AD5uMEnaGx-JIkLyJmEF7IjjU8bQfv_hZTkH_KOeaGZySsQCmdSPZEPHHAzUaUkcDAOZghttps://console.developers.google.com/project/reducted/?authuser=1\n"

re.findall(p, test_str)

Answer 2

您需要使用 positive lookahead assertion。

>>> s = "https://00e9e64bac25fa94607-apidata.googleusercontent.com/download/redacted?qk=AD5uMEnaGx-JIkLyJmEF7IjjU8bQfv_hZTkH_KOeaGZySsQCmdSPZEPHHAzUaUkcDAOZghttps://console.developers.google.com/project/reducted/?authuser=1\n"
>>> re.findall(r'https?://.*?(?=https?://|$|\s)', s)
['https://00e9e64bac25fa94607-apidata.googleusercontent.com/download/redacted?qk=AD5uMEnaGx-JIkLyJmEF7IjjU8bQfv_hZTkH_KOeaGZySsQCmdSPZEPHHAzUaUkcDAOZg', 'https://console.developers.google.com/project/reducted/?authuser=1']

正则表达式在没有分隔符的文本中分隔 url

Regex separate urls in text that has no separators

python

regex

url

findall