使用 Look Behind 或 Look Ahead 函数查找匹配项时的正则表达式模式

Question

我正在尝试根据 python 中的正常语法规则正确拆分句子。

我要拆分的句子是

s = """Mr. Smith bought cheapsite.com for 1.5 million dollars,
i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a
probability of .9 it isn't."""

预期输出是

Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.

Did he mind?

Adam Jones Jr. thinks he didn't.

In any case, this isn't true...

Well, with a probability of .9 it isn't.

为了实现这一点，我使用了 regular ，经过大量搜索后，我发现了以下正则表达式，它执行 trick.The new_str 只是为了从 's' 中删除一些 \n

m = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s',new_str)

for i in m:
    print (i)



Mr. Smith bought cheapsite.com for 1.5 million dollars,i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with aprobability of .9 it isn't.

所以我理解正则表达式的方式是我们首先 selecting

1) 所有字符如

2) 从第一个 selection 过滤的 spaces 中，我们 select 这些字符没有 Mr. Mrs. 等词

3) 从过滤的第二步开始，我们 select 只有那些我们有点或问题并且前面有 space.

的主题

所以我试着改变顺序如下

1) 先过滤掉所有标题。

2) 来自过滤后的步骤 select 前面有 space

的那些

3) 删除所有短语，如 i.e

但是当我这样做时，后面的空白也会被拆分

m = re.split(r'(?<![A-Z][a-z]\.)(?<=\.|\?)\s(?<!\w\.\w.)',new_str)

for i in m:
    print (i)


Mr. Smith bought cheapsite.com for 1.5 million dollars,i.e.
he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with aprobability of .9 it isn't.

修改后的程序中的最后一步不应该能够识别像 i.e ,why is it failed to detect it 这样的短语吗？

Answer 1

首先，(?<!\w\.\w.) 中的最后一个 . 看起来很可疑，如果你需要用它匹配一个文字点，请将其转义 ((?<!\w\.\w\.))。

回到问题上来，当你使用 r'(?<![A-Z][a-z]\.)(?<=\.|\?)\s(?<!\w\.\w.)' 正则表达式时，最后一个负向后视检查 whitespace 之后的位置是否没有以单词 char、点、单词 char、 任何字符（因为 . 未转义）。此条件为真，因为在该位置之前有一个点 e、另一个 . 和一个 space。

要使后视以与之前 \s 相同的方式工作，请将 \s 也放入后视模式中：

(?<![A-Z][a-z]\.)(?<=\.|\?)\s(?<!\w\.\w.\s)

见regex demo

另一个增强功能是在第二个回顾中使用字符 class：(?<=\.|\?) -> (?<=[.?]).

使用 Look Behind 或 Look Ahead 函数查找匹配项时的正则表达式模式

Pattern of regular expressions while using Look Behind or Look Ahead Functions to find a match

python

regex

nlp

negative-lookbehind

tokenize