Stringr str_replace_all 遗漏了重复的术语
Stringr str_replace_all misses repeated terms
我在使用 stringr::str_replace_all 函数时遇到问题。我正在尝试用 insuredvehicle 替换 iv 的所有实例,但该函数似乎只捕捉到第一个词。
temp_data <- data.table(text = 'the driver of the 1st vehicle hit the iv iv at a stop')
temp_data[, new_text := stringr::str_replace_all(pattern = ' iv ', replacement = ' insuredvehicle ', string = text)]
结果如下所示,错过了第 2 个 iv 项:
1: the driver of the 1st vehicle hit the insuredvehicle iv at a stop
我认为问题在于这 2 个实例共享一个 space,这是搜索模式的一部分。我这样做是因为我想替换 iv 术语,而不是 driver.[= 中的 iv 14=]
我不想简单地将重复项合并为 1。我希望结果如下所示:
1: the driver of the 1st vehicle hit the insuredvehicle insuredvehicle at a stop
如果能帮我实现这个功能,我将不胜感激!
也许如果您在正则表达式中包含单词边界,而不是从替换中删除白色 spaces?当您只需要一个与模式匹配的完整单词而不是单词的一部分时,它是理想的选择,同时远离这些空白 space 问题。
\b
似乎可以解决问题
temp_data[, new_text := stringr::str_replace_all(pattern = '\biv\b', replacement = 'insuredvehicle', string = text)]
new_text
1: the driver of the 1st vehicle hit the insuredvehicle insuredvehicle at a stop
您可以使用环视:
temp_data[, new_text := stringr::str_replace_all(pattern = '(?<= )iv(?= )', replacement = 'insuredvehicle', string = text)]
输出:
"the driver of the 1st vehicle hit the insuredvehicle insuredvehicle at a stop"
使用gsub
:
gsub("\biv\b", "insuredvehicle", temp_data$text)
[1] "the driver of the 1st vehicle hit the uninsuredvehicle uninsuredvehicle at a stop"
使用space边界:
temp_data <- data.table(text = 'the driver of the 1st vehicle hit the iv iv at a stop')
temp_data[, new_text := stringr::str_replace_all(pattern = '(?<!\S)iv(?!\S)', replacement = 'insuredvehicle', string = text)]
参见regex proof。
解释
--------------------------------------------------------------------------------
(?<! look behind to see if there is not:
--------------------------------------------------------------------------------
\S non-whitespace (all but \n, \r, \t, \f,
and " ")
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
iv 'iv'
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
\S non-whitespace (all but \n, \r, \t, \f,
and " ")
--------------------------------------------------------------------------------
) end of look-ahead
我在使用 stringr::str_replace_all 函数时遇到问题。我正在尝试用 insuredvehicle 替换 iv 的所有实例,但该函数似乎只捕捉到第一个词。
temp_data <- data.table(text = 'the driver of the 1st vehicle hit the iv iv at a stop')
temp_data[, new_text := stringr::str_replace_all(pattern = ' iv ', replacement = ' insuredvehicle ', string = text)]
结果如下所示,错过了第 2 个 iv 项:
1: the driver of the 1st vehicle hit the insuredvehicle iv at a stop
我认为问题在于这 2 个实例共享一个 space,这是搜索模式的一部分。我这样做是因为我想替换 iv 术语,而不是 driver.[= 中的 iv 14=]
我不想简单地将重复项合并为 1。我希望结果如下所示:
1: the driver of the 1st vehicle hit the insuredvehicle insuredvehicle at a stop
如果能帮我实现这个功能,我将不胜感激!
也许如果您在正则表达式中包含单词边界,而不是从替换中删除白色 spaces?当您只需要一个与模式匹配的完整单词而不是单词的一部分时,它是理想的选择,同时远离这些空白 space 问题。
\b
似乎可以解决问题
temp_data[, new_text := stringr::str_replace_all(pattern = '\biv\b', replacement = 'insuredvehicle', string = text)]
new_text
1: the driver of the 1st vehicle hit the insuredvehicle insuredvehicle at a stop
您可以使用环视:
temp_data[, new_text := stringr::str_replace_all(pattern = '(?<= )iv(?= )', replacement = 'insuredvehicle', string = text)]
输出:
"the driver of the 1st vehicle hit the insuredvehicle insuredvehicle at a stop"
使用gsub
:
gsub("\biv\b", "insuredvehicle", temp_data$text)
[1] "the driver of the 1st vehicle hit the uninsuredvehicle uninsuredvehicle at a stop"
使用space边界:
temp_data <- data.table(text = 'the driver of the 1st vehicle hit the iv iv at a stop')
temp_data[, new_text := stringr::str_replace_all(pattern = '(?<!\S)iv(?!\S)', replacement = 'insuredvehicle', string = text)]
参见regex proof。
解释
--------------------------------------------------------------------------------
(?<! look behind to see if there is not:
--------------------------------------------------------------------------------
\S non-whitespace (all but \n, \r, \t, \f,
and " ")
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
iv 'iv'
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
\S non-whitespace (all but \n, \r, \t, \f,
and " ")
--------------------------------------------------------------------------------
) end of look-ahead