如何使用正则表达式在定义的字符串之前获取第一句话

Question

我正在做一些抓取，我想抓取 src 元素的某个部分，但不确定如何使用正则表达式执行此操作。这里有正则表达式专家可以帮助我吗？

srcset="https://cimg.co/w/articles/1/5ca/f022bb06dc.png 150w, https://cimg.co/w/articles/2/5ca/f022bb06dc.png 300w, https://cimg.co/w/articles/3/5ca/f022bb06dc.png 600w, https://cimg.co/w/articles/4/5ca/f022bb06dc.png 1200w"

我想要1200w之前的第一个url。所以结果应该是：

https://cimg.co/w/articles/4/5ca/f022bb06dc.png

为什么我需要正则表达式，最后一个元素：

提前致谢，祝周末愉快:)

Answer 1

不需要正则表达式。您可以使用字符串方法 split 和 partition:

In [181]: srcset = "https://cimg.co/w/articles/1/5ca/f022bb06dc.png 150w, https://cimg.co/w/articles/2/5ca/f022bb06dc.png 300w, https://cimg.co/w/articles/3/5ca/f022bb06dc.png 600w, https://cimg.co/w/arti
     ...: cles/4/5ca/f022bb06dc.png 1200w"                                                                                                                                                                  

In [182]: def get_url(srcset): 
     ...:     for str_ in srcset.split(','): 
     ...:         url, _, ext = str_.strip().partition(' ') 
     ...:         if ext == '1200w': 
     ...:             return url 
     ...:                                                                                                                                                                                                   

In [183]: get_url(srcset)                                                                                                                                                                                   
Out[183]: 'https://cimg.co/w/articles/4/5ca/f022bb06dc.png'

假设 , 没有在 URL 内出现。

如果你必须使用正则表达式，你可以这样做：

https?://\S+(?=\s+1200w\b)

所以：

In [184]: re.search(r'https?://\S+(?=\s+1200w\b)', srcset).group()                                                                                                                                          
Out[184]: 'https://cimg.co/w/articles/4/5ca/f022bb06dc.png'

https?://\S+ 匹配 URL
零宽度正向先行 (?=\s+1200w\b) 确保 URL 后跟一个或多个空格 (\s+)，然后是 1200w

OTOH，如果你觉得HTTP scheme based matching不爽，你可以匹配start或者,，抓取第一个捕获的组：

In [185]: re.search(r'(?:^|,\s+)(\S+)\s+1200w\b', srcset).group(1)                                                                                                                                          
Out[185]: 'https://cimg.co/w/articles/4/5ca/f022bb06dc.png'

Answer 2

或者：

a = 'srcset="https://cimg.co/w/articles/1/5ca/f022bb06dc.png 150w, https://cimg.co/w/articles/2/5ca/f022bb06dc.png 300w, https://cimg.co/w/articles/3/5ca/f022bb06dc.png 600w, https://cimg.co/w/articles/4/5ca/f022bb06dc.png 1200w"'

a = a.replace('srcset=', '').replace('"', '').split(',')
done = a[len(a)-1].strip().split(' ')[0]
print(done)

Answer 3

您可以使用这个正则表达式：

[^\s,"]+(?=\s+1200w\b)

Answer 4

搜索 r"600w, (.*) 1200w" ，您第 1 组应该 return 您正在寻找的 url。

Answer 5

模式 .+?(?=1200w) 将匹配除换行符之外的任何字符 1+ 次，直到右边的是 1200.

要使用正则表达式获得更具体的匹配，您可以使用捕获组：

\bsrcset="[^"]* (https?://\S+)\s+1200w"

Regex demo | Python demo

例如：

import re
regex = r'\bsrcset="[^"]* (https?://\S+)\s+1200w"'
test_str = """srcset=\"https://cimg.co/w/articles/1/5ca/f022bb06dc.png 150w, https://cimg.co/w/articles/2/5ca/f022bb06dc.png 300w, https://cimg.co/w/articles/3/5ca/f022bb06dc.png 600w, https://cimg.co/w/articles/4/5ca/f022bb06dc.png 1200w\""""

matches = re.search(regex, test_str)
if matches:
    print(matches.group(1))

结果

https://cimg.co/w/articles/4/5ca/f022bb06dc.png

如何使用正则表达式在定义的字符串之前获取第一句话

How to get the first sentence before a defined string with Regex

python

regex

screen-scraping