从字符向量中删除不是特定单词的所有单词
remove all words from a character vector that are NOT certain words
我有一个像这样的字符列表
[70] "CSF 5896-6133"
[71] "CRT 16"
[72] "SEEF 54-55"
[73] "CIF 190-195"
[74] "DE & /ON CIF 196-222"
[75] " CRT 17 "
[76] " SEEF 56-57"
[77] "DE & /ON CSF 6134-6725 "
[78] " SEEF 58-60"
[79] "CRT 18"
[80] " CSF 6726-6837"
[81] "SEEF 61"
[82] " CSF 6840-6926"
[83] " CIF 223-226"
[84] "SEEF 62-63"
[85] " CSF 6927-7065"
[86] " CIF 226-228"
[87] "CSF 7066-7185"
[88] "CSF 7186-7311"
[89] " CIF 229"
[90] " SEEF 66"
[91] "CSF 7312-7561"
[92] " CRT 19"
[93] " SEEF 67-68"
[94] "Final data QAQC done on CSF 1-7561"
[95] " CIF 1-229"
[96] " SEEF 1-68 "
[97] " CRT 1-19"
[98] "082015-HOBA-G17-1 changed to offPlot based on GIS review of searched area"
如您所见,这只是其中的一部分。
我想删除所有不是数字或
的词
CSF, CIF, SEEF, CRT
因此,例如 94-98 的部分看起来像
[94] "CSF 1-7561"
[95] " CIF 1-229"
[96] " SEEF 1-68 "
[97] " CRT 1-19"
如您所见,第 98 行将被完全删除,因为它有 none 个我希望它具有的关键字。第 94 行也删除了一些单词。
考虑以下向量:
v <- c("Final data QAQC done on CSF 1-7561",
"CIF 1-229",
"SEEF 1-68",
"CRT 1-19",
"082015-HOBA-G17-1 changed to offPlot based on GIS review of searched area")
你可以这样做:
## vector with words to match
cond <- c("CSF", "CIF", "SEEF", "CRT")
## regex that captures digits and tolerates dashes (-)
reg <- "(\d+-?)+$"
## pattern to match either words or regex
pattern <- paste(c(cond, reg), collapse = "|")
然后使用 stringi
包中的 stri_extract_all()
:
library(stringi)
stri_extract_all_regex(v, pattern)
给出:
#[[1]]
#[1] "CSF" "1-7561"
#
#[[2]]
#[1] "CIF" "1-229"
#
#[[3]]
#[1] "SEEF" "1-68"
#
#[[4]]
#[1] "CRT" "1-19"
#
#[[5]]
#[1] NA
正如@akrun 所提到的,您还可以:
regmatches(v, gregexpr(pattern, v))
给出:
#[[1]]
#[1] "CSF" "1-7561"
#
#[[2]]
#[1] "CIF" "1-229"
#
#[[3]]
#[1] "SEEF" "1-68"
#
#[[4]]
#[1] "CRT" "1-19"
#
#[[5]]
#character(0)
使用stringr
:
library(stringr)
testString <- c("Final data QAQC done on CSF 1-7561" ,
" CIF 1-229" ,
" SEEF 1-68 ",
" CRT 1-19",
"082015-HOBA-G17-1 changed to offPlot based on GIS review of searched area" )
str_extract(testString, "(CSF|CIF|SEEF|CRT)\s+\d+-\d+")
[1] "CSF 1-7561" "CIF 1-229" "SEEF 1-68" "CRT 1-19" NA
我会使用 stringr
库。
这是您数据的一个子集。
x <- c("CSF 5896-6133",
"CRT 16",
"SEEF 54-55",
"CIF 190-195",
"Final data QAQC done on CSF 1-7561",
"082015-HOBA-G17-1 changed to offPlot based on GIS review of searched area"
)
您可以使用 str_extract
和匹配您的模式的正则表达式。
library(stringr)
> str_extract(x, '(CSF|CIF|SEEF|CRT)[:space:]+([0-9]|-)+')
[1] "CSF 5896-6133" "CRT 16" "SEEF 54-55" "CIF 190-195" "CSF 1-7561"
[6] NA
当您没有任何匹配模式时,它将 return 一个缺失值。
我有一个像这样的字符列表
[70] "CSF 5896-6133"
[71] "CRT 16"
[72] "SEEF 54-55"
[73] "CIF 190-195"
[74] "DE & /ON CIF 196-222"
[75] " CRT 17 "
[76] " SEEF 56-57"
[77] "DE & /ON CSF 6134-6725 "
[78] " SEEF 58-60"
[79] "CRT 18"
[80] " CSF 6726-6837"
[81] "SEEF 61"
[82] " CSF 6840-6926"
[83] " CIF 223-226"
[84] "SEEF 62-63"
[85] " CSF 6927-7065"
[86] " CIF 226-228"
[87] "CSF 7066-7185"
[88] "CSF 7186-7311"
[89] " CIF 229"
[90] " SEEF 66"
[91] "CSF 7312-7561"
[92] " CRT 19"
[93] " SEEF 67-68"
[94] "Final data QAQC done on CSF 1-7561"
[95] " CIF 1-229"
[96] " SEEF 1-68 "
[97] " CRT 1-19"
[98] "082015-HOBA-G17-1 changed to offPlot based on GIS review of searched area"
如您所见,这只是其中的一部分。
我想删除所有不是数字或
的词CSF, CIF, SEEF, CRT
因此,例如 94-98 的部分看起来像
[94] "CSF 1-7561"
[95] " CIF 1-229"
[96] " SEEF 1-68 "
[97] " CRT 1-19"
如您所见,第 98 行将被完全删除,因为它有 none 个我希望它具有的关键字。第 94 行也删除了一些单词。
考虑以下向量:
v <- c("Final data QAQC done on CSF 1-7561",
"CIF 1-229",
"SEEF 1-68",
"CRT 1-19",
"082015-HOBA-G17-1 changed to offPlot based on GIS review of searched area")
你可以这样做:
## vector with words to match
cond <- c("CSF", "CIF", "SEEF", "CRT")
## regex that captures digits and tolerates dashes (-)
reg <- "(\d+-?)+$"
## pattern to match either words or regex
pattern <- paste(c(cond, reg), collapse = "|")
然后使用 stringi
包中的 stri_extract_all()
:
library(stringi)
stri_extract_all_regex(v, pattern)
给出:
#[[1]]
#[1] "CSF" "1-7561"
#
#[[2]]
#[1] "CIF" "1-229"
#
#[[3]]
#[1] "SEEF" "1-68"
#
#[[4]]
#[1] "CRT" "1-19"
#
#[[5]]
#[1] NA
正如@akrun 所提到的,您还可以:
regmatches(v, gregexpr(pattern, v))
给出:
#[[1]]
#[1] "CSF" "1-7561"
#
#[[2]]
#[1] "CIF" "1-229"
#
#[[3]]
#[1] "SEEF" "1-68"
#
#[[4]]
#[1] "CRT" "1-19"
#
#[[5]]
#character(0)
使用stringr
:
library(stringr)
testString <- c("Final data QAQC done on CSF 1-7561" ,
" CIF 1-229" ,
" SEEF 1-68 ",
" CRT 1-19",
"082015-HOBA-G17-1 changed to offPlot based on GIS review of searched area" )
str_extract(testString, "(CSF|CIF|SEEF|CRT)\s+\d+-\d+")
[1] "CSF 1-7561" "CIF 1-229" "SEEF 1-68" "CRT 1-19" NA
我会使用 stringr
库。
这是您数据的一个子集。
x <- c("CSF 5896-6133",
"CRT 16",
"SEEF 54-55",
"CIF 190-195",
"Final data QAQC done on CSF 1-7561",
"082015-HOBA-G17-1 changed to offPlot based on GIS review of searched area"
)
您可以使用 str_extract
和匹配您的模式的正则表达式。
library(stringr)
> str_extract(x, '(CSF|CIF|SEEF|CRT)[:space:]+([0-9]|-)+')
[1] "CSF 5896-6133" "CRT 16" "SEEF 54-55" "CIF 190-195" "CSF 1-7561"
[6] NA
当您没有任何匹配模式时,它将 return 一个缺失值。