环顾四周模式偶尔不起作用

Question

我在 R 中使用正则表达式并处理超声心动图数据集。我想检测出现称为“SAM”的现象的情况，显然我想排除“无 SAM”之类的情况

所以我写了这行：

pattern_sam <- regex("(?<!no )sam", ignore_case = TRUE)
str_view_all(echo_1_lvot$description_echo, pattern_sam, match = TRUE)

它有效地消除了 99.9% 的“没有 SAM”的案例，但出于某种原因我仍然得到 3 个“没有 SAM”的案例（见下图）

现在奇怪的是，如果我简单地将这些字符串复制粘贴到一个新的数据集中，这个问题就会消失...

sam_test <- tibble(description_echo = c(
  "There is asymmetric septal hypertrophy severe LVH in all myocardial segements with spared basal posterior wall with asymmetric septal hypertrophy(anteroseptal thickness=2cm,PWD=0.94cm),no SAM nor LVOT obstruction which is compatible with type III HCM",
  "-Normal LV size with mild to moderate systolic dysfunction,EF=45%,severe LVH in all myocardial segements with spared basal posterior wall with asymmetric septal hypertrophy(anteroseptal thickness=2cm,PWD=0.94cm),no SAM nor LVOT obstruction which is compa"
))

str_view_all(sam_test$description_echo, pattern_sam)

当我尝试检测其他模式时，同样的事情发生了

有没有人知道根本问题是什么以及如何解决？

P.S: here is the .xls file (I only included the problematic string), if you want to see for yourself

有趣的是，当我从 .xls 中手动删除“No SAM”并在完全相同的位置重新键入它时，问题就消失了。还是不知道哪里出了问题，会不会是文本格式的问题？

Answer 1

您可以使用 \s 匹配任何空格，甚至是 Unicode 空格，因为您使用的是 ICU 正则表达式风格（它与所有 stringr/stringi 正则表达式函数一起使用）：

pattern_sam <- regex("(?<!no\s)sam", ignore_case = TRUE)

要匹配任何非单词字符，包括一些不可打印的字符，请使用

regex("(?<!no\W)sam", ignore_case = TRUE)

此外，如果可以有多个，您可以使用约束宽度后视（在 ICU 和 Java 中可用）：

pattern_sam <- regex("(?<!no\s{1,10})sam", ignore_case = TRUE)
pattern_sam <- regex("(?<!no\W{1,10})sam", ignore_case = TRUE)

这里，1到10个字符可以在no和sam之间。

如果需要整词匹配，加\b，词边界：

pattern_sam <- regex("(?<!\bno\s{1,10})sam\b", ignore_case = TRUE)
pattern_sam <- regex("(?<!\bno\W{1,10})sam\b", ignore_case = TRUE)

环顾四周模式偶尔不起作用

Look around pattern doesn't occasionally work

regex

r

regex-lookarounds

tidyverse