如何在 Python 中指定重复正则表达式
How to specify repetitions Regex in Python
我有这个要处理的字符串:
rl/NNP ada/VBI yg/SC tau/VBT penginapan/NN under/NN 800k/CDP di/IN jogja/NNP buat/VBT malioboro/NNP +-10/NN org/NN yg/SC deket/JJ malioboro/NNP ?/.
我想从那句话中取出di/IN jogja/NNP buat/VBT malioboro/NNP
个词。到目前为止,这是我的代码:
def entityExtractPreposition(text):
text = re.findall(r'([^\s/]*/IN\b[^/]*(?:/(?!IN\b)[^/]*)*/NNP\b)', text)
return text
text = "rl/NNP ada/VBI yg/SC tau/VBT penginapan/NN under/NN 800k/CDP di/IN jogja/NNP buat/VBT malioboro/NNP +-10/NN org/NN yg/SC deket/JJ malioboro/NNP ?/."
prepo = entityExtractPreposition(text)
print prepo
结果取出了多少字:
di/IN jogja/NNP buat/VBT malioboro/NNP +-10/NN org/NN yg/SC deket/JJ malioboro/NNP
我的预期结果是:
di/IN jogja/NNP buat/VBT malioboro/NNP
我读到一些参考文献说有限制重复的规则(在我的例子中是 /NNP),比如 *
/ +
/ ?
。初始化或限制正则表达式重复次数的最佳方法是什么?
您必须分两次完成此操作。先找到 /IN -> /NNP 的块,然后在该块内搜索最多只占用第二个(或 n
)/NNP,例如:
def extract(text, n=2):
try:
match = re.search('\w+/IN.*\w+/NNP', text).group()
last_match = list(re.finditer('\w+/NNP', match))[:n][-1]
return match[:last_match.end()]
except AttributeError:
return ''
示例使用和输出:
In [36]: extract(text, 1)
Out[36]: 'di/IN jogja/NNP'
In [37]: extract(text, 2)
Out[37]: 'di/IN jogja/NNP buat/VBT malioboro/NNP'
In [38]: extract(text, 3)
Out[38]: 'di/IN jogja/NNP buat/VBT malioboro/NNP +-10/NN org/NN yg/SC deket/JJ malioboro/NNP'
In [39]: extract('nothing to see here')
Out[39]: ''
The first/IN up to and including the second/NNP
实施规则的模式:
^.*?\b(\w+\/IN(?:.*?\w+\/NNP\b){2})
^.*? # Starting from the beginning, thus match only first
\b # A word boundary
( # Captured group
\w+\/IN # One or more word chars, then a slash, then 'IN'
(?: # A non-captured group
.*?\w+ # Anything, lazily matched, followed by one or more word chars
\/NNP\b # A slash, then 'NNP', then a word boundary
){2} # Exactly twice
) # End of captured group
我有这个要处理的字符串:
rl/NNP ada/VBI yg/SC tau/VBT penginapan/NN under/NN 800k/CDP di/IN jogja/NNP buat/VBT malioboro/NNP +-10/NN org/NN yg/SC deket/JJ malioboro/NNP ?/.
我想从那句话中取出di/IN jogja/NNP buat/VBT malioboro/NNP
个词。到目前为止,这是我的代码:
def entityExtractPreposition(text):
text = re.findall(r'([^\s/]*/IN\b[^/]*(?:/(?!IN\b)[^/]*)*/NNP\b)', text)
return text
text = "rl/NNP ada/VBI yg/SC tau/VBT penginapan/NN under/NN 800k/CDP di/IN jogja/NNP buat/VBT malioboro/NNP +-10/NN org/NN yg/SC deket/JJ malioboro/NNP ?/."
prepo = entityExtractPreposition(text)
print prepo
结果取出了多少字:
di/IN jogja/NNP buat/VBT malioboro/NNP +-10/NN org/NN yg/SC deket/JJ malioboro/NNP
我的预期结果是:
di/IN jogja/NNP buat/VBT malioboro/NNP
我读到一些参考文献说有限制重复的规则(在我的例子中是 /NNP),比如 *
/ +
/ ?
。初始化或限制正则表达式重复次数的最佳方法是什么?
您必须分两次完成此操作。先找到 /IN -> /NNP 的块,然后在该块内搜索最多只占用第二个(或 n
)/NNP,例如:
def extract(text, n=2):
try:
match = re.search('\w+/IN.*\w+/NNP', text).group()
last_match = list(re.finditer('\w+/NNP', match))[:n][-1]
return match[:last_match.end()]
except AttributeError:
return ''
示例使用和输出:
In [36]: extract(text, 1)
Out[36]: 'di/IN jogja/NNP'
In [37]: extract(text, 2)
Out[37]: 'di/IN jogja/NNP buat/VBT malioboro/NNP'
In [38]: extract(text, 3)
Out[38]: 'di/IN jogja/NNP buat/VBT malioboro/NNP +-10/NN org/NN yg/SC deket/JJ malioboro/NNP'
In [39]: extract('nothing to see here')
Out[39]: ''
The first/IN up to and including the second/NNP
实施规则的模式:
^.*?\b(\w+\/IN(?:.*?\w+\/NNP\b){2})
^.*? # Starting from the beginning, thus match only first
\b # A word boundary
( # Captured group
\w+\/IN # One or more word chars, then a slash, then 'IN'
(?: # A non-captured group
.*?\w+ # Anything, lazily matched, followed by one or more word chars
\/NNP\b # A slash, then 'NNP', then a word boundary
){2} # Exactly twice
) # End of captured group