在字符串中获取 url 的最快方法

Question

我要检查数千个字符串，我需要得到包含 instagram.com/p/

的完整 url

到目前为止我使用的是这个方法：

msg ='hello there http://instagram.com/p/BvluRHRhN16/'
msg = re.findall(
            'http[s]?://?[\w/\-?=%.]+instagram.com/p/(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',
            msg)
print(msg)

但它找不到某些 url。

我想得到所有 url 如下所示：

https://instagram.com/p/BvluRHRhN16/ https://www.instagram.com/p/BvluRHRhN16/ http://instagram.com/p/BvluRHRhN16/ https://www.instagram.com/p/BvluRHRhN16/ www.instagram.com/p/BvluRHRhN16/

如何以最快的方式获得此结果？

Answer 1

我假设输入是包含 URL 的句子列表。希望这可以帮助。

msg =['hello there http://google.com/p/BvluRHRhN16/ this is a test',
      'hello there https://www.instagram.com/p/BvluRHRhN16/',
      'hello there www.instagram.com/p/BvluRHRhN16/ this is a test',
      'hello there https://www.instagram.net/p/BvluRHRhN16/ this is a test'
     ]

for m in msg:
    ms = re.findall('(http.*instagram.+\/p.+|www.*instagram.+\/p.+)',m)
    print(ms)

已编辑正则表达式：

ms = re.findall('(http.*instagram\.com\/p.+\/|www.*instagram\.com\/p.+\/)',m)

Answer 2

url = '''
'hello there http://google.com/p/BvluRHRhN16/ this is a test',
      'hello there https://www.instagram.com/p/BvluRHRhN16/',
      'hello there www.instagram.com/p/BvluRHRhN16/ this is a test',
      'hello there https://www.instagram.net/p/BvluRHRhN16/ this is a test'
'''

from urlextract import URLExtract

extractor = URLExtract()
urls = extractor.find_urls(url)
print(urls)

输出： ['http://google.com/p/BvluRHRhN16/', 'https://www.instagram.com/p/BvluRHRhN16/', 'www.instagram.com/p/BvluRHRhN16/', 'https://www.instagram.net/p/BvluRHRhN16/']

已编辑：过滤 url 的

filtered = ([item for item in urls if "instagram.com/p/" in item])

print(filtered)

输出： ['https://www.instagram.com/p/BvluRHRhN16/', 'www.instagram.com/p/BvluRHRhN16/']

在字符串中获取 url 的最快方法

The fastest way to get a url inside a string

python

regex

findall