正则表达式，单词除外

Question

大家好，我正在尝试将文本拆分成句子，为此，如果同时存在空格和 space，我将使用正则表达式拆分逻辑。

文本文件由“先生”、“U.S”组成。和“U.K”。这是我不想将文本分成句子的字母。

我正在使用的正则表达式可以在附图中看到 *

这可以完成工作，但会产生问题。

此处正则表达式在“我的名字是 Aishwarya 先生”这句话中选择了“a.”。如何将其更改为只选择“.”？

Answer 1

您可以使用

\b(?:Mr|U\.S|U\.K)\.(*SKIP)(*F)|\.\s+
# A shrunk version:
\b(?:Mr|U\.[SK])\.(*SKIP)(*F)|\.\s+

见regex demo。详情:

\b(?:Mr|U\.S|U\.K)\.(*SKIP)(*F)——匹配整个单词Mr.、U.S.、U.K.并跳过匹配，从失败位置开始搜索下一个匹配
| - 或
\.\s+ - 一个 .，然后是一个或多个空格。

看到一个 R demo:

x <- "My name is Mr. Aishwarya. I live in the U.K. and want to go to U.S. Us will go to Australi. Who else wants to go to U.S.A. ? My wife's name is Ruchika Bhatt. "
strsplit(x, "\b(?:Mr|U\.S|U\.K)\.(*SKIP)(*F)|\.\s+", perl=TRUE)

输出：

[[1]]
[1] "My name is Mr. Aishwarya"                                        
[2] "I live in the U.K. and want to go to U.S. Us will go to Australi"
[3] "Who else wants to go to U.S.A"                                   
[4] "? My wife's name is Ruchika Bhatt"

Answer 2

您尝试的模式是 negated character class [^U.S|U.K.|Mr.]，因为它以 ^ 开头并匹配除列出的字符之外的任何字符。

也可以写成[^.US|KMr](\.\s)，之所以匹配a. ，是因为class.

字符中没有列出a

您想要的是一种带有括号 () 和竖线 | 来分隔备选方案的分组机制。

另一种选择是使用负向后视，在匹配点和白色space.

(?<!\bU\.[SK]|\bMr)\.\s

模式匹配：

(?<! 负后视，断言直接左边的不是
- \bU\.[SK] 匹配 U.S 或 U.K
- | 或
- \bMr匹配先生
) 关闭回顾
\.\s匹配一个点和一个白色space字符（或者使用\h不匹配换行符，只匹配一个space）

Regex demo

在 R 中为 Perl 兼容的正则表达式设置 perl=TRUE：

strsplit(
    "My name is Mr. Aishwarya. I live in the U.K. and want to go to U.S. Us will go to Australi. Who else wants to go to U.S.A. ? My wife's name is Ruchika Bhatt. ",
    "(?<!\bU\.[SK]|\bMr)\.\s",
    perl=TRUE
)

输出

[[1]]
[1] "My name is Mr. Aishwarya"                                        
[2] "I live in the U.K. and want to go to U.S. Us will go to Australi"
[3] "Who else wants to go to U.S.A"                                   
[4] "? My wife's name is Ruchika Bhatt"

正则表达式，单词除外

Regex with exception for words

regex

r

regex-negation