在正则表达式模式中使用动态 int 变量 python

Question

我正处于学习的最初几天 python，如果这个问题已经被问到，抱歉。

我写在这里是因为那些对我没有帮助，我的要求是读取一个文件并在 it.Inside 中打印所有 URL 的 for 循环我的正则表达式模式使用的是[^https://][\w\W]*，效果很好。但是我想知道我是否可以动态传递 https:// 之后的行的长度并获得出现次数而不是 *

的输出

我试过 [^https://][\w\W]{var}} 其中 var=len(line)-len(https://)

这些是我尝试过的其他模式

pattern = '[^https://][\w\W]{'+str(int(var))+'}'

pattern = r'[^https://][\w\W]{{}}'.format(var)

pattern = r'[^https://][\w\W]{%s}'%var

Answer 1

我可能误解了你的问题，但如果你知道 url 总是以 https:// 开头，那么它就是前八个字符。然后找到urls:

就可以得到长度了

# Example of list containing urls - you should fill that with your for loop
list_urls = [' 'https://google.com', 'https://whosebug.com']
for url in list_urls:
    print(url[8:])

出来

whosebug.com/questions/61006253/use-dynamic-int-variable-inside-regex-pattern-python
google.com
whosebug.com

您可以使用 re.findall

找到所有 url 而不是 for 循环

import re

url_pattern = "((https:\/\/)([\w-]+\.)+[\w-]+[.+]+([\w%\/~\+#]*))"
# text referes to your document, that should be read before this
urls = re.findall(url_pattern, text)

# Using list comprehensions
# Get the unique urls by using set
# Only get text after https:// using [8:]
# Only parse the first element of the group that is returned by re.findall using [0]
unique_urls = list(set([x[0][8:] for x in urls]))

# print the urls
print(unique_urls)

Answer 2

在您的模式中，您使用 [^https://]，这是一个 negated character class [^，它将匹配除列出的任何字符。

一种选择是使用文字字符串插值。假设您的链接不包含空格，您可以使用 \S 而不是 [\w\W]，因为后一种变体将匹配任何字符，包括空格和换行符。

\bhttps://\S{{{var}}}(?!\S)

Regex demo

最后的断言 (?!\S) 是一个空白边界，以防止部分匹配，单词边界 \b 将防止 http 成为更大单词的一部分。

Python demo

例如

import re
line = "https://www.test.com"
lines = "https://www.test.com https://thisisatestt https://www.dontmatchme"

var=len(line)-len('https://')
pattern = rf"\bhttps://\S{{{var}}}(?!\S)"

print(re.findall(pattern, lines))

输出

['https://www.test.com', 'https://thisisatestt']

在正则表达式模式中使用动态 int 变量 python

use dynamic int variable inside regex pattern python

python

regex

variables

int