在 python 的 url 检测中使用 re.search 时出现问题
Problem while using re.search in url detection in python
我正在使用 re.search 将字符串中的 URL 替换为占位符“{url_object}”。这是我的代码:
def url_detector(text):
urls = re.findall(r"(https?\:\/\/[\w\d:#@%\/;$()~_?\+-=\\.&]*)", text)
if len(urls)>0:
for url in urls:
span = re.search(url, text).span()
text = text[:(span[0])] + '{url_object}' + text[span[1]:]
return text
我用的URL文字如下:
text_list = ["google url is https://www.google.com/. Everyone frequently uses it",
"https://www.google.com/search?q=simple+search&oq=simple+search&aqs=chrome..69i57j0l9.2908j0j7&sourceid=chrome&ie=UTF-8 is the url for simplesearch",
"url for today's news : https://www.google.com/search?q=news+today&sxsrf=ALeKk00r1fVK6JeIaO1bhigZSu8IEGjgQw%3A1617353154494&ei=wtlmYIrXHdXez7sP6v-XwAw&oq=news+today&gs_lcp=Cgdnd3Mtd2l6EAMyCggAELEDEIMBEEMyCAgAELEDEIMBMggIABCxAxCDATIICAAQsQMQgwEyCAgAELEDEIMBMggIABCxAxCDATIICAAQsQMQgwEyCAgAELEDEIMBMggIABCxAxCDATICCAA6BwgAEEcQsAM6BwgAELADEEM6CgguELADEMgDEEM6BAgAEEM6BwgAELEDEEM6BQgAELEDSgUIOBIBMVDUBliRFGCzGGgBcAJ4AIABnAGIAacHkgEDMC43mAEAoAEBqgEHZ3dzLXdpesgBC8ABAQ&sclient=gws-wiz&ved=0ahUKEwiKwP-Blt_vAhVV73MBHer_BcgQ4dUDCA0&uact=5, date = 02/04/2021",
"sample url = https://www.google.com/search?q=sample&sxsrf=ALeKk02uixAiZMyqhMtSZZwbeYefHRutGQ%3A1617353222151&ei=BtpmYKfTCJKD4-EPlLWVeA&oq=sample&gs_lcp=Cgdnd3Mtd2l6EAMyBAgjECcyBQgAELEDMgUIABCxAzIECAAQQzIFCAAQsQMyBAgAEEMyAggAMgUIABCxAzIFCAAQsQMyBQgAELEDOgcIABBHELADOgcIABCwAxBDOgcIABCHAhAUUKERWN4UYMIYaAFwAngAgAGWAYgB9ASSAQMwLjWYAQCgAQGqAQdnd3Mtd2l6yAEKwAEB&sclient=gws-wiz&ved=0ahUKEwin7qCilt_vAhWSwTgGHZRaBQ8Q4dUDCA0&uact=5"]
我在上面的列表中尝试了url_detector
for text in text_list:
print(url_detector(text))
虽然我期望输出如下所示:
google url is {url_object}. Everyone frequently uses it
{url_object} is the url for simple search
url for today's news : {url_object}, date = 02/04/2021
sample url = {url_object}
我知道了:
google url is {url_object}. Everyone frequently uses it
'NoneType' object has no attribute 'span'
这似乎是由于“?”的存在而发生的在从 re.findall.
获得的 URL 中
这可能是因为重新治疗'?'因为它的特殊意义。所以,我尝试替换“?”和 '\?'让它工作。但 '?'正在替换为“\\?”。当这种模式与 re.search() 一起使用时,它会生成错误:
error: bad escape (end of pattern) at position 29
.
关于如何解决这个问题的任何想法?提前致谢。
尝试将url
in re.escape()
放在句子span = re.search(url, text).span()
中,如下所示:
span = re.search(re.escape(url), text).span()
原因是因为您在第一个 re.findall()
中提取的结果包含一些特殊字符,例如?
正则表达式引擎将其视为特殊的正则表达式标记。因此,即使您稍后使用 re.search()
搜索已经匹配的结果,由于这些特殊字符被正则表达式引擎错误解释,它仍然会不匹配(因此 return NoneType 对象)。
Escape special characters in pattern. This is useful if you want to
match an arbitrary literal string that may have regular expression
metacharacters in it.
我正在使用 re.search 将字符串中的 URL 替换为占位符“{url_object}”。这是我的代码:
def url_detector(text):
urls = re.findall(r"(https?\:\/\/[\w\d:#@%\/;$()~_?\+-=\\.&]*)", text)
if len(urls)>0:
for url in urls:
span = re.search(url, text).span()
text = text[:(span[0])] + '{url_object}' + text[span[1]:]
return text
我用的URL文字如下:
text_list = ["google url is https://www.google.com/. Everyone frequently uses it",
"https://www.google.com/search?q=simple+search&oq=simple+search&aqs=chrome..69i57j0l9.2908j0j7&sourceid=chrome&ie=UTF-8 is the url for simplesearch",
"url for today's news : https://www.google.com/search?q=news+today&sxsrf=ALeKk00r1fVK6JeIaO1bhigZSu8IEGjgQw%3A1617353154494&ei=wtlmYIrXHdXez7sP6v-XwAw&oq=news+today&gs_lcp=Cgdnd3Mtd2l6EAMyCggAELEDEIMBEEMyCAgAELEDEIMBMggIABCxAxCDATIICAAQsQMQgwEyCAgAELEDEIMBMggIABCxAxCDATIICAAQsQMQgwEyCAgAELEDEIMBMggIABCxAxCDATICCAA6BwgAEEcQsAM6BwgAELADEEM6CgguELADEMgDEEM6BAgAEEM6BwgAELEDEEM6BQgAELEDSgUIOBIBMVDUBliRFGCzGGgBcAJ4AIABnAGIAacHkgEDMC43mAEAoAEBqgEHZ3dzLXdpesgBC8ABAQ&sclient=gws-wiz&ved=0ahUKEwiKwP-Blt_vAhVV73MBHer_BcgQ4dUDCA0&uact=5, date = 02/04/2021",
"sample url = https://www.google.com/search?q=sample&sxsrf=ALeKk02uixAiZMyqhMtSZZwbeYefHRutGQ%3A1617353222151&ei=BtpmYKfTCJKD4-EPlLWVeA&oq=sample&gs_lcp=Cgdnd3Mtd2l6EAMyBAgjECcyBQgAELEDMgUIABCxAzIECAAQQzIFCAAQsQMyBAgAEEMyAggAMgUIABCxAzIFCAAQsQMyBQgAELEDOgcIABBHELADOgcIABCwAxBDOgcIABCHAhAUUKERWN4UYMIYaAFwAngAgAGWAYgB9ASSAQMwLjWYAQCgAQGqAQdnd3Mtd2l6yAEKwAEB&sclient=gws-wiz&ved=0ahUKEwin7qCilt_vAhWSwTgGHZRaBQ8Q4dUDCA0&uact=5"]
我在上面的列表中尝试了url_detector
for text in text_list:
print(url_detector(text))
虽然我期望输出如下所示:
google url is {url_object}. Everyone frequently uses it
{url_object} is the url for simple search
url for today's news : {url_object}, date = 02/04/2021
sample url = {url_object}
我知道了:
google url is {url_object}. Everyone frequently uses it
'NoneType' object has no attribute 'span'
这似乎是由于“?”的存在而发生的在从 re.findall.
获得的 URL 中这可能是因为重新治疗'?'因为它的特殊意义。所以,我尝试替换“?”和 '\?'让它工作。但 '?'正在替换为“\\?”。当这种模式与 re.search() 一起使用时,它会生成错误:
error: bad escape (end of pattern) at position 29
.
关于如何解决这个问题的任何想法?提前致谢。
尝试将url
in re.escape()
放在句子span = re.search(url, text).span()
中,如下所示:
span = re.search(re.escape(url), text).span()
原因是因为您在第一个 re.findall()
中提取的结果包含一些特殊字符,例如?
正则表达式引擎将其视为特殊的正则表达式标记。因此,即使您稍后使用 re.search()
搜索已经匹配的结果,由于这些特殊字符被正则表达式引擎错误解释,它仍然会不匹配(因此 return NoneType 对象)。
Escape special characters in pattern. This is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.