如何捕获先行后向正则表达式 python
How to capture both lookahead lookbehind regex python
这是一个字符串:
str = "Academy \nADDITIONAL\nAwards and Recognition: Greek Man of the Year 2011 Stanford PanHellenic Community, American Delegate 2010 Global\nEngagement Summit, Honorary Speaker 2010 SELA Convention, Semi-Finalist 2010 Strauss Foundation Scholarship Program\nComputer Skills: Competency: MATLAB, MySQL/PHP, JavaScript, Objective-C, Git Proficiency: Adobe Creative Suite, Excel\n(highly advanced), PowerPoint, HTML5/CSS3\nLanguages: Fluent English, Advanced Spanish\n\x0c"
我想从 "ADDTIONAL" 捕获到 "Languages" 所以我写了这个正则表达式:
regex = r'(?<=\n(ADDITIONAL|Additional)\n)[\s\S]+?(?=\n(Languages|LANGUAGES)\n*)'
但是它只捕获 ([\s\S]+)
之间的所有内容。它不会捕获 ADDTIONAL
& Languages
。我在这里错过了什么?
如果您想将它们包含在匹配中,请不要将它们放在环视中,因为它们的目的是测试周围的文本而不将其包含在匹配结果中。如果你只是需要交替使用普通的非捕获组。
regex = r'\n(?:ADDITIONAL|Additional)\n[\s\S]+?\n(?:Languages|LANGUAGES)\n*'
顺便说一句,您的正则表达式需要在 ADDITIONAL
和 Languages
周围换行,但您的字符串中没有换行符。
它正在被捕获,但它不是捕获组 0 的一部分,因为组 0
仅包含 consumed 匹配,即移动当前
的匹配
位置。
断言不会移动位置,因此如果您在断言内部进行捕获
它不会成为比赛的一部分。
然而,如果断言后面跟着一些消耗的子表达式
断言中引用的那些,它将成为整体匹配的一部分。
您当前的正则表达式与您的字符串不匹配。匹配您拥有的字符串
删除换行符 \n
引用。
(?<=
( ADDITIONAL | Additional ) # (1)
)
[\s\S]+?
(?=
( Languages | LANGUAGES ) # (2)
)
试试这个
(?<=ADDITIONAL\s).*?(?=\sLanguages)
解释:
(?<=…)
:正面回顾 sample
\s
: "whitespace character": space, tab, newline, carriage return, vertical tab sample
.
:除换行符外的任何字符sample
*
:零次或多次sample
?
:一次或none sample
(?=…)
:正面前瞻 sample
Python:
import re
p = re.compile(ur'(?<=ADDITIONAL\s).*?(?=\sLanguages)', re.IGNORECASE)
test_str = u"the companys direction ADDITIONAL Awards: 2010 Global Engagement Summit, Languages: Fluent Japanese"
g = re.findall(p, test_str)
print g //[u'Awards: 2010 Global Engagement Summit,']
如果您只需要捕获包含 ADDITIONAL
和 LANGUAGES
的内容,请使用像这样的简单正则表达式。
\b(ADDITIONAL .* Languages)\b
确保在解决方案中使用时包含 re.IGNORECASE 标志。
在 REGEX101
查看演示
我猜你把简单的事情复杂化了,即:
match = re.search("(ADDITIONAL.*?Languages)", subject, re.MULTILINE)
正则表达式解释:
(ADDITIONAL.*?Languages)
Match the regex below and capture its match into backreference number 1 «(ADDITIONAL.*?Languages)»
Match the character string “ADDITIONAL” literally (case sensitive) «ADDITIONAL»
Match any single character that is NOT a line break character (line feed) «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the character string “Languages” literally (case sensitive) «Languages»
你的正则表达式是
regex = r'(?<=\n(ADDITIONAL|Additional)\n)[\s\S]+?(?=\n(Languages|LANGUAGES)\n*)'
你的字符串是
Academy \nADDITIONAL\nAwards and Recognition: ... \nLanguages:
^^ ^^
|| ||
Match Position:-(?<=\n(ADDITIONAL|Additional)\n)(?=\n(Languages|LANGUAGES)\n*)
所以[\s\S]+?
将包含这两个位置之间的内容,不包括ADDITIONAL
和LANGUAGES
。
你只需要找到ADDITIONAL
的开始位置和LANGUAGES
的结束位置。这可以使用以下正则表达式
来完成
(?=\n(ADDITIONAL|Additional)\n)([\s\S]+?)(?<=\n(Languages|LANGUAGES)\b)
此外,如果您只想[\s\S]+?
捕获所有内容,那么您可以对Additional
和Languages
使用非捕获组
(?=\n(?:ADDITIONAL|Additional)\n)[\s\S]+?(?<=\n(?:Languages|LANGUAGES)\b)
Academy \nADDITIONAL\nAwards and Recognition: ... \nLanguages:
^^ ^^
|| ||
(?=\n(ADDITIONAL|Additional)\n) (?<=\n(Languages|LANGUAGES))
Python代码
p = re.compile(r'(?=\n(?:ADDITIONAL|Additional)\n)[\s\S]+?(?<=\n(?:Languages|LANGUAGES)\b)', re.MULTILINE)
test_str = "Academy \nADDITIONAL\nAwards and Recognition: Greek Man of the Year 2011 Stanford PanHellenic Community, American Delegate 2010 Global\nEngagement Summit, Honorary Speaker 2010 SELA Convention, Semi-Finalist 2010 Strauss Foundation Scholarship Program\nComputer Skills: Competency: MATLAB, MySQL/PHP, JavaScript, Objective-C, Git Proficiency: Adobe Creative Suite, Excel\n(highly advanced), PowerPoint, HTML5/CSS3\nLanguages: Fluent English, Advanced Spanish\n\x0c"
print(re.findall(p, test_str))
这是一个字符串:
str = "Academy \nADDITIONAL\nAwards and Recognition: Greek Man of the Year 2011 Stanford PanHellenic Community, American Delegate 2010 Global\nEngagement Summit, Honorary Speaker 2010 SELA Convention, Semi-Finalist 2010 Strauss Foundation Scholarship Program\nComputer Skills: Competency: MATLAB, MySQL/PHP, JavaScript, Objective-C, Git Proficiency: Adobe Creative Suite, Excel\n(highly advanced), PowerPoint, HTML5/CSS3\nLanguages: Fluent English, Advanced Spanish\n\x0c"
我想从 "ADDTIONAL" 捕获到 "Languages" 所以我写了这个正则表达式:
regex = r'(?<=\n(ADDITIONAL|Additional)\n)[\s\S]+?(?=\n(Languages|LANGUAGES)\n*)'
但是它只捕获 ([\s\S]+)
之间的所有内容。它不会捕获 ADDTIONAL
& Languages
。我在这里错过了什么?
如果您想将它们包含在匹配中,请不要将它们放在环视中,因为它们的目的是测试周围的文本而不将其包含在匹配结果中。如果你只是需要交替使用普通的非捕获组。
regex = r'\n(?:ADDITIONAL|Additional)\n[\s\S]+?\n(?:Languages|LANGUAGES)\n*'
顺便说一句,您的正则表达式需要在 ADDITIONAL
和 Languages
周围换行,但您的字符串中没有换行符。
它正在被捕获,但它不是捕获组 0 的一部分,因为组 0
仅包含 consumed 匹配,即移动当前
的匹配
位置。
断言不会移动位置,因此如果您在断言内部进行捕获
它不会成为比赛的一部分。
然而,如果断言后面跟着一些消耗的子表达式
断言中引用的那些,它将成为整体匹配的一部分。
您当前的正则表达式与您的字符串不匹配。匹配您拥有的字符串
删除换行符 \n
引用。
(?<=
( ADDITIONAL | Additional ) # (1)
)
[\s\S]+?
(?=
( Languages | LANGUAGES ) # (2)
)
试试这个
(?<=ADDITIONAL\s).*?(?=\sLanguages)
解释:
(?<=…)
:正面回顾 sample
\s
: "whitespace character": space, tab, newline, carriage return, vertical tab sample
.
:除换行符外的任何字符sample
*
:零次或多次sample
?
:一次或none sample
(?=…)
:正面前瞻 sample
Python:
import re
p = re.compile(ur'(?<=ADDITIONAL\s).*?(?=\sLanguages)', re.IGNORECASE)
test_str = u"the companys direction ADDITIONAL Awards: 2010 Global Engagement Summit, Languages: Fluent Japanese"
g = re.findall(p, test_str)
print g //[u'Awards: 2010 Global Engagement Summit,']
如果您只需要捕获包含 ADDITIONAL
和 LANGUAGES
的内容,请使用像这样的简单正则表达式。
\b(ADDITIONAL .* Languages)\b
确保在解决方案中使用时包含 re.IGNORECASE 标志。
在 REGEX101
查看演示我猜你把简单的事情复杂化了,即:
match = re.search("(ADDITIONAL.*?Languages)", subject, re.MULTILINE)
正则表达式解释:
(ADDITIONAL.*?Languages)
Match the regex below and capture its match into backreference number 1 «(ADDITIONAL.*?Languages)»
Match the character string “ADDITIONAL” literally (case sensitive) «ADDITIONAL»
Match any single character that is NOT a line break character (line feed) «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the character string “Languages” literally (case sensitive) «Languages»
你的正则表达式是
regex = r'(?<=\n(ADDITIONAL|Additional)\n)[\s\S]+?(?=\n(Languages|LANGUAGES)\n*)'
你的字符串是
Academy \nADDITIONAL\nAwards and Recognition: ... \nLanguages:
^^ ^^
|| ||
Match Position:-(?<=\n(ADDITIONAL|Additional)\n)(?=\n(Languages|LANGUAGES)\n*)
所以[\s\S]+?
将包含这两个位置之间的内容,不包括ADDITIONAL
和LANGUAGES
。
你只需要找到ADDITIONAL
的开始位置和LANGUAGES
的结束位置。这可以使用以下正则表达式
(?=\n(ADDITIONAL|Additional)\n)([\s\S]+?)(?<=\n(Languages|LANGUAGES)\b)
此外,如果您只想[\s\S]+?
捕获所有内容,那么您可以对Additional
和Languages
(?=\n(?:ADDITIONAL|Additional)\n)[\s\S]+?(?<=\n(?:Languages|LANGUAGES)\b)
Academy \nADDITIONAL\nAwards and Recognition: ... \nLanguages:
^^ ^^
|| ||
(?=\n(ADDITIONAL|Additional)\n) (?<=\n(Languages|LANGUAGES))
Python代码
p = re.compile(r'(?=\n(?:ADDITIONAL|Additional)\n)[\s\S]+?(?<=\n(?:Languages|LANGUAGES)\b)', re.MULTILINE)
test_str = "Academy \nADDITIONAL\nAwards and Recognition: Greek Man of the Year 2011 Stanford PanHellenic Community, American Delegate 2010 Global\nEngagement Summit, Honorary Speaker 2010 SELA Convention, Semi-Finalist 2010 Strauss Foundation Scholarship Program\nComputer Skills: Competency: MATLAB, MySQL/PHP, JavaScript, Objective-C, Git Proficiency: Adobe Creative Suite, Excel\n(highly advanced), PowerPoint, HTML5/CSS3\nLanguages: Fluent English, Advanced Spanish\n\x0c"
print(re.findall(p, test_str))