使用正则表达式区分大小写的连字符替换

Question

我正在尝试使用德语输入清理 R 中的一些文本。

library(tidyverse)
bye_bye_hyphenation <- function(x){
  # removes words separated by hyphenation f.e. due to PDF input
  # eliminate line breaks
  # first group for characters (incl. European ones) (\1), dash and following whitespace,
  # second group for characters (\2) (incl. European ones)
  stringr::str_replace_all(x, "([a-z|A-Z\x7f-\xff]{1,})\-[\s]{1,}([a-z|A-Z\x7f-\xff]{1,})", "\1\2")
}

# this works correctly
"Ex-\n ample" %>% 
  bye_bye_hyphenation()
#> [1] "Example"

# this should stay the same, `Regierungsund` should not be
# concatenated
"Regierungs- und Verwaltungsgesetz" %>%
  bye_bye_hyphenation()
#> [1] "Regierungsund Verwaltungsgesetz"

^{由 reprex package (v0.3.0)}

于 2019-06-19 创建

有人知道如何使整个 Regex 区分大小写，这样它就不会在第二种情况下触发，即只要单词 und 出现在破折号和 [=21= 之后]？

Answer 1

也许您可以使用负面或正面前瞻（参见 Regex lookahead, lookbehind and atomic groups）。下面的正则表达式删除破折号后跟潜在的换行符或 space 如果它是 而不是 后跟单词 "und" 并且仅删除换行符否则：

library(stringr)

string1 <- "Ex- ample"
string2 <- "Ex-\n ample"
string3 <- "Regierungs- und Verwaltungsgesetz"
string4 <- "Regierungs-\n und Verwaltungsgesetz"

pattern <- "(-\n?\s?(?!\n?\s?und))|(\n(?=\s?und))"

str_remove(string1, pattern)
#> [1] "Example"
str_remove(string2, pattern)
#> [1] "Example"
str_remove(string3, pattern)
#> [1] "Regierungs- und Verwaltungsgesetz"
str_remove(string4, pattern)
#> [1] "Regierungs- und Verwaltungsgesetz"

^{由 reprex package (v0.3.0)}

于 2019-06-19 创建

使用正则表达式区分大小写的连字符替换

Case-sensitive hyphenation replacement with regex

backreference

r

regex-group