如何从交替正则表达式中选择第一个匹配项？

Question

我正在尝试从 URL 之前的推文中提取所有以 "https:..." 开头的文本。

推文示例：

"This traditional hairdo is back in fashion thanks to the coronavirus, and Kenyans are using it to raise awareness https://... (Video via @QuickTake)"

在这个例子中，我想删除 "https://... (Video via @QuickTake)" 并从头开始获取文本。 但它也适用于推文中没有任何 URL link 的推文。

我试过这个表达式，当它带有 URL:

时得到了两个匹配项

/(.*)(?=\shttps.*)|(.*)

我怎样才能让它只检索推文中的文本。

提前致谢！

Answer 1

这可能过于简单化了，但简单的 str.find 可能就可以解决问题：

>>> s = "This traditional hairdo is back in fashion thanks to the coronavirus, and Kenyans are using it to raise awareness https://... (Video via @QuickTake)"
>>> s[:s.find('https://')]
'This traditional hairdo is back in fashion thanks to the coronavirus, and Kenyans are using it to raise awareness '

您基本上只是将推文编入索引，直到找到 https:// 的第一个实例为止。

请注意，在 https:// 未出现在推文中的情况下，仅靠这种方法是行不通的。当找不到 https:// 时，s.find('https://') 将 return -1，这会弄乱我们的索引。如果找不到，只需将索引器（下面的link_index）设置为完整推文的长度：

>>> s = 'this is some tweet without a URL'
>>> link_index = s.find('https://')
>>> if link_index == -1:
...     link_index = len(s)
... 
>>> s[:link_index]
'this is some tweet without a URL'

Answer 2

您可以删除 https 和所有跟随到字符串末尾的内容，使用

tweet = re.sub(r'\s*https.*', '', tweet)

详情：

\s* - 0+ 个空格
https - 一个字符串
.* - 字符串的其余部分（行）。

如何从交替正则表达式中选择第一个匹配项？

How to choose first match from Alternation regex?

python

regex

tweepy

tweets