在 R txt 文件中查找和替换数字
Find and replace numbers in an R txt file
我试图在 r 中的文本文件中找到所有包含任何格式数字的句子,并将其替换为周围的主题标签。
例如采用以下输入:
ex <- c("I have .78 in my account","Hello my name is blank","do you want 1,785 puppies?",
"I love stack overflow!","My favorite numbers are 3, 14,568, and 78")
作为函数的输出,我正在寻找:
> "I have #.78# in my account"
> "do you want #1,785# puppies?"
> "My favorite numbers are #3#, #14,568#, and #78#"
周围的数字是直截了当的,假设所有带有数字、句点、逗号和美元符号的东西都包括在内。
gsub("\b([-[=10=]-9.,]+)\b", "#\1#", ex)
# [1] "I have $#5.78# in my account"
# [2] "Hello my name is blank"
# [3] "do you want #1,785# puppies?"
# [4] "I love stack overflow!"
# [5] "My favorite numbers are #3#, #14,568#, and #78#"
要仅过滤掉编号的条目:
grep("\d", gsub("\b([-[=11=]-9.,]+)\b", "#\1#", ex), value = TRUE)
# [1] "I have $#5.78# in my account"
# [2] "do you want #1,785# puppies?"
# [3] "My favorite numbers are #3#, #14,568#, and #78#"
我们可以使用gsub
gsub("(?<=\s)(?=[[=10=]-9])|(?<=[0-9])(?=,?[ ]|$)", "#", ex, perl = TRUE)
#[1] "I have #.78# in my account" "Hello my name is blank"
#[3] "do you want #1,785# puppies?" "I love stack overflow!"
#[5] "My favorite numbers are #3#, #14,568#, and #78#"
另一种循序渐进的方法是使用 grep
识别包含模式 "[0-9]"
的文本文件元素,使用 ex[....]
对带有数字条目的文本元素进行子集化,以及使用 library(dplyr)
中的管道运算符 %>%
将子集传递给 gsub
,然后使用 @r2evans 的逻辑在数字条目周围放置主题标签,如下所示:
library(dplyr)
ex[do.call(grep,list("[0-9]",ex))] %>% gsub("\b([-[=10=]-9.,]+)\b", "#\1#",.)
The do.call(grep,list("[0-9]",ex))
portion of the code returns the indices for the text elements in ex with numeric entries.
Output
library(dplyr)
ex[do.call(grep,list("[0-9]",ex))] %>% gsub("\b([-[=11=]-9.,]+)\b", "#\1#",.)
[1] "I have $#5.78# in my account" "do you want #1,785# puppies?"
[3] "My favorite numbers are #3#, #14,568#, and #78#"
我试图在 r 中的文本文件中找到所有包含任何格式数字的句子,并将其替换为周围的主题标签。
例如采用以下输入:
ex <- c("I have .78 in my account","Hello my name is blank","do you want 1,785 puppies?",
"I love stack overflow!","My favorite numbers are 3, 14,568, and 78")
作为函数的输出,我正在寻找:
> "I have #.78# in my account"
> "do you want #1,785# puppies?"
> "My favorite numbers are #3#, #14,568#, and #78#"
周围的数字是直截了当的,假设所有带有数字、句点、逗号和美元符号的东西都包括在内。
gsub("\b([-[=10=]-9.,]+)\b", "#\1#", ex)
# [1] "I have $#5.78# in my account"
# [2] "Hello my name is blank"
# [3] "do you want #1,785# puppies?"
# [4] "I love stack overflow!"
# [5] "My favorite numbers are #3#, #14,568#, and #78#"
要仅过滤掉编号的条目:
grep("\d", gsub("\b([-[=11=]-9.,]+)\b", "#\1#", ex), value = TRUE)
# [1] "I have $#5.78# in my account"
# [2] "do you want #1,785# puppies?"
# [3] "My favorite numbers are #3#, #14,568#, and #78#"
我们可以使用gsub
gsub("(?<=\s)(?=[[=10=]-9])|(?<=[0-9])(?=,?[ ]|$)", "#", ex, perl = TRUE)
#[1] "I have #.78# in my account" "Hello my name is blank"
#[3] "do you want #1,785# puppies?" "I love stack overflow!"
#[5] "My favorite numbers are #3#, #14,568#, and #78#"
另一种循序渐进的方法是使用 grep
识别包含模式 "[0-9]"
的文本文件元素,使用 ex[....]
对带有数字条目的文本元素进行子集化,以及使用 library(dplyr)
中的管道运算符 %>%
将子集传递给 gsub
,然后使用 @r2evans 的逻辑在数字条目周围放置主题标签,如下所示:
library(dplyr)
ex[do.call(grep,list("[0-9]",ex))] %>% gsub("\b([-[=10=]-9.,]+)\b", "#\1#",.)
The
do.call(grep,list("[0-9]",ex))
portion of the code returns the indices for the text elements in ex with numeric entries.Output
library(dplyr)
ex[do.call(grep,list("[0-9]",ex))] %>% gsub("\b([-[=11=]-9.,]+)\b", "#\1#",.)
[1] "I have $#5.78# in my account" "do you want #1,785# puppies?"
[3] "My favorite numbers are #3#, #14,568#, and #78#"