从某个短语开始到某个短语
Start with certain phrase up to certain phrase
我有一些文本,其中一些实际上有预定义的模板,这些模板对分析没有任何价值。
我想使用 regex
系统地删除 template
(通常由 header text
和 greetings
和 closing text
组成 thank you
,这样我就可以专注于 variable text
。
header
和 closing
都可能有可变文本,例如 variable location
或 variable staff name
。所以 text 1
可能 location
等于 ABC
并且 staff name
等于 Sofia
.
have <- "Hello, thank you for contacting our Pizza Store, <variable location>. \n\r Please find below our available menu:\nMenu 1 USD 1.99\nMenu 2 USD 3.99\n\n\n Sincerely,\nThe Awesome Pizza Team\n<variable staff name>\nDelivering Pizza 24/7"
want <- "\nMenu 1 USD 1.99\nMenu 2 USD 3.99\n"
header <- "Hello, thank you for contacting our Pizza Store, <variable location>. \n\r Please find below our available menu:"
tail <- "\n\n Sincerely,\nThe Awesome Pizza Team\n<variable staff name>\nDelivering Pizza 24/7"
我目前的尝试如下。
# remove everything before 'menu'
gsub('(.*)menu:','', have)
# want to correct the above to
# remove everything that
# starts with "Hello, thank you for contacting" up to "Please find our available menu"
# remove everything after Sincerely, inclusive
gsub('Sincerely.*','', have)
# want to correct the above to
# remove everything that
# starts with "Sincerely,\nThe Awesome Pizza Team" up to "\nDelivering Pizza 24/7"
第二次尝试
# text
have <- "Hello, thank you for contacting our Pizza Store, <variable location>. \n\r Please find below our available menu:\nMenu 1 USD 1.99\nMenu 2 USD 3.99\n\n\n Sincerely,\nThe Awesome Pizza Team\n<variable staff name>\nDelivering Pizza 24/7"
# remove any text in between 'Hello, thank you for contacting`
# up to 'Please find below our available menu:'
# and also the anchoring texts
(want <- gsub(pattern = '(Hello, thank you for contacting).*(Please find below our available menu:)',''
, x = have))
# remove any text after `\n\n Sincerely,\nThe Awesome Pizza Team\n`, inclusive the text itself
(want <- gsub(pattern = '\n\n Sincerely,\nThe Awesome Pizza Team\n.*',''
, x = want))
一个选项可能是匹配菜单之前的所有行。然后捕获所有以 Menu 开头的连续行,并匹配以 Sincerely 开头的其余行。
在替换中使用捕获组 1。
^[\s\S]*?\R((?:Menu .*\R+)*)\s*Sincerely,[\s\S]*
模式匹配:
^
字符串开始
[\s\S]*?\R
尽可能少地匹配任何字符,然后是换行符
(
捕获 组 1
(?:Menu .*\R+)*
重复匹配以 Menu
开头的所有行并匹配换行符
)
关闭组 1
\s*
匹配可选的空白字符
Sincerely,
字面匹配
[\s\S]*
匹配其余行
示例
have <- "Hello, thank you for contacting our Pizza Store, <variable location>. \n\r Please find below our available menu:\nMenu 1 USD 1.99\nMenu 2 USD 3.99\n\n\n Sincerely,\nThe Awesome Pizza Team\n<variable staff name>\nDelivering Pizza 24/7"
trimws(gsub('^[\s\S]*?\R((?:Menu .*\R+)*)\s*Sincerely,[\s\S]*','\1', have, perl = TRUE))
输出
[1] "Menu 1 USD 1.99\nMenu 2 USD 3.99"
更长一点的更精确的模式可能是:
^(?:(?!Menu ).*(?:\R(?!Menu ).*)*\R+)?(Menu .*(?:\RMenu .*)*)\R\s*Sincerely,[\s\S]*
我有一些文本,其中一些实际上有预定义的模板,这些模板对分析没有任何价值。
我想使用 regex
系统地删除 template
(通常由 header text
和 greetings
和 closing text
组成 thank you
,这样我就可以专注于 variable text
。
header
和 closing
都可能有可变文本,例如 variable location
或 variable staff name
。所以 text 1
可能 location
等于 ABC
并且 staff name
等于 Sofia
.
have <- "Hello, thank you for contacting our Pizza Store, <variable location>. \n\r Please find below our available menu:\nMenu 1 USD 1.99\nMenu 2 USD 3.99\n\n\n Sincerely,\nThe Awesome Pizza Team\n<variable staff name>\nDelivering Pizza 24/7"
want <- "\nMenu 1 USD 1.99\nMenu 2 USD 3.99\n"
header <- "Hello, thank you for contacting our Pizza Store, <variable location>. \n\r Please find below our available menu:"
tail <- "\n\n Sincerely,\nThe Awesome Pizza Team\n<variable staff name>\nDelivering Pizza 24/7"
我目前的尝试如下。
# remove everything before 'menu'
gsub('(.*)menu:','', have)
# want to correct the above to
# remove everything that
# starts with "Hello, thank you for contacting" up to "Please find our available menu"
# remove everything after Sincerely, inclusive
gsub('Sincerely.*','', have)
# want to correct the above to
# remove everything that
# starts with "Sincerely,\nThe Awesome Pizza Team" up to "\nDelivering Pizza 24/7"
第二次尝试
# text
have <- "Hello, thank you for contacting our Pizza Store, <variable location>. \n\r Please find below our available menu:\nMenu 1 USD 1.99\nMenu 2 USD 3.99\n\n\n Sincerely,\nThe Awesome Pizza Team\n<variable staff name>\nDelivering Pizza 24/7"
# remove any text in between 'Hello, thank you for contacting`
# up to 'Please find below our available menu:'
# and also the anchoring texts
(want <- gsub(pattern = '(Hello, thank you for contacting).*(Please find below our available menu:)',''
, x = have))
# remove any text after `\n\n Sincerely,\nThe Awesome Pizza Team\n`, inclusive the text itself
(want <- gsub(pattern = '\n\n Sincerely,\nThe Awesome Pizza Team\n.*',''
, x = want))
一个选项可能是匹配菜单之前的所有行。然后捕获所有以 Menu 开头的连续行,并匹配以 Sincerely 开头的其余行。
在替换中使用捕获组 1。
^[\s\S]*?\R((?:Menu .*\R+)*)\s*Sincerely,[\s\S]*
模式匹配:
^
字符串开始[\s\S]*?\R
尽可能少地匹配任何字符,然后是换行符(
捕获 组 1(?:Menu .*\R+)*
重复匹配以Menu
开头的所有行并匹配换行符
)
关闭组 1\s*
匹配可选的空白字符Sincerely,
字面匹配[\s\S]*
匹配其余行
示例
have <- "Hello, thank you for contacting our Pizza Store, <variable location>. \n\r Please find below our available menu:\nMenu 1 USD 1.99\nMenu 2 USD 3.99\n\n\n Sincerely,\nThe Awesome Pizza Team\n<variable staff name>\nDelivering Pizza 24/7"
trimws(gsub('^[\s\S]*?\R((?:Menu .*\R+)*)\s*Sincerely,[\s\S]*','\1', have, perl = TRUE))
输出
[1] "Menu 1 USD 1.99\nMenu 2 USD 3.99"
更长一点的更精确的模式可能是:
^(?:(?!Menu ).*(?:\R(?!Menu ).*)*\R+)?(Menu .*(?:\RMenu .*)*)\R\s*Sincerely,[\s\S]*