从某个短语开始到某个短语

Start with certain phrase up to certain phrase

我有一些文本,其中一些实际上有预定义的模板,这些模板对分析没有任何价值。

我想使用 regex 系统地删除 template(通常由 header textgreetingsclosing text 组成 thank you,这样我就可以专注于 variable text

headerclosing 都可能有可变文本,例如 variable locationvariable staff name。所以 text 1 可能 location 等于 ABC 并且 staff name 等于 Sofia.

have <- "Hello, thank you for contacting our Pizza Store, <variable location>. \n\r Please find below our available menu:\nMenu 1 USD 1.99\nMenu 2 USD 3.99\n\n\n Sincerely,\nThe Awesome Pizza Team\n<variable staff name>\nDelivering Pizza 24/7"

want <- "\nMenu 1 USD 1.99\nMenu 2 USD 3.99\n"


header <- "Hello, thank you for contacting our Pizza Store, <variable location>. \n\r Please find below our available menu:"

tail <- "\n\n Sincerely,\nThe Awesome Pizza Team\n<variable staff name>\nDelivering Pizza 24/7"


我目前的尝试如下。

# remove everything before 'menu'
gsub('(.*)menu:','', have)
# want to correct the above to
# remove everything that 
# starts with "Hello, thank you for contacting" up to "Please find our available menu"

# remove everything after Sincerely, inclusive
gsub('Sincerely.*','', have)
# want to correct the above to
# remove everything that 
# starts with "Sincerely,\nThe Awesome Pizza Team" up to "\nDelivering Pizza 24/7"

第二次尝试

# text
have <- "Hello, thank you for contacting our Pizza Store, <variable location>. \n\r Please find below our available menu:\nMenu 1 USD 1.99\nMenu 2 USD 3.99\n\n\n Sincerely,\nThe Awesome Pizza Team\n<variable staff name>\nDelivering Pizza 24/7"

# remove any text in between 'Hello, thank you for contacting` 
# up to 'Please find below our available menu:'
# and also the anchoring texts
(want <- gsub(pattern = '(Hello, thank you for contacting).*(Please find below our available menu:)',''
     , x = have))

# remove any text after `\n\n Sincerely,\nThe Awesome Pizza Team\n`, inclusive the text itself 
(want <- gsub(pattern = '\n\n Sincerely,\nThe Awesome Pizza Team\n.*',''
             , x = want))

一个选项可能是匹配菜单之前的所有行。然后捕获所有以 Menu 开头的连续行,并匹配以 Sincerely 开头的其余行。

在替换中使用捕获组 1。

^[\s\S]*?\R((?:Menu .*\R+)*)\s*Sincerely,[\s\S]*

模式匹配:

  • ^字符串开始
  • [\s\S]*?\R 尽可能少地匹配任何字符,然后是换行符
  • ( 捕获 组 1
    • (?:Menu .*\R+)* 重复匹配以 Menu 开头的所有行并匹配换行符
  • ) 关闭组 1
  • \s* 匹配可选的空白字符
  • Sincerely,字面匹配
  • [\s\S]* 匹配其余行

Regex demo | R demo

示例

have <- "Hello, thank you for contacting our Pizza Store, <variable location>. \n\r Please find below our available menu:\nMenu 1 USD 1.99\nMenu 2 USD 3.99\n\n\n Sincerely,\nThe Awesome Pizza Team\n<variable staff name>\nDelivering Pizza 24/7"
trimws(gsub('^[\s\S]*?\R((?:Menu .*\R+)*)\s*Sincerely,[\s\S]*','\1', have, perl = TRUE))

输出

[1] "Menu 1 USD 1.99\nMenu 2 USD 3.99"

更长一点的更精确的模式可能是:

 ^(?:(?!Menu ).*(?:\R(?!Menu ).*)*\R+)?(Menu .*(?:\RMenu .*)*)\R\s*Sincerely,[\s\S]*

Regex demo | R demo