找到字符串中第一个数字的位置 [R]

Locate position of first number in string [R]

如何在 R 中创建一个函数来定位字符串中第一个数字的单词位置?

例如:

string1 <- "Hello I'd like to extract where the first 1010 is in this string"
#desired_output for string1
9

string2 <- "80111 is in this string"
#desired_output for string2
1

string3 <- "extract where the first 97865 is in this string"
#desired_output for string3
5

这里是 return 您想要的输出的方法:

library(stringr)
min(which(!is.na(suppressWarnings(as.numeric(str_split(string, " ", simplify = TRUE))))))

这是它的工作原理:

str_split(string, " ", simplify = TRUE) # converts your string to a vector/matrix, splitting at space

as.numeric(...) # tries to convert each element to a number, returning NA when it fails

suppressWarnings(...) # suppresses the warnings generated by as.numeric

!is.na(...) # returns true for the values that are not NA (i.e. the numbers)

which(...) # returns the position for each TRUE values

min(...) # returns the first position

输出:

min(which(!is.na(suppressWarnings(as.numeric(str_split(string1, " ", simplify = TRUE))))))
[1] 9
min(which(!is.na(suppressWarnings(as.numeric(str_split(string2, " ", simplify = TRUE))))))
[1] 1
min(which(!is.na(suppressWarnings(as.numeric(str_split(string3, " ", simplify = TRUE))))))
[1] 5

尝试以下操作:

library(stringr)

position_first_number <- function(string) {
  min(which(str_detect(str_split(string, "\s+", simplify = TRUE), "[0-9]+")))
}

使用您的示例字符串:

> string1 <- "Hello I'd like to extract where the first 1010 is in this string"
> position_first_number(string1)
[1] 9
 
> string2 <- "80111 is in this string"
> position_first_number(string2)
[1] 1
 
> string3 <- "extract where the first 97865 is in this string"
> position_first_number(string3)
[1] 5

我只想在这里使用 grepstrsplit 作为基础 R 选项:

sapply(input, function(x) grep("\d+", strsplit(x, " ")[[1]]))

Hello I'd like to extract where the first 1010 is in this string
                                                               9
                                         80111 is in this string
                                                               1
                 extract where the first 97865 is in this string
                                                               5

数据:

input <- c("Hello I'd like to extract where the first 1010 is in this string",
           "80111 is in this string",
           "extract where the first 97865 is in this string")

这是一个基本解决方案,使用 rapply() w/ grep() 递归 strsplit() 的结果并使用字符串向量。

注意:如果您想在任何白色 space 上拆分字符串,请将 " "fixed = TRUE 替换为 "\s+"fixed = FALSE(默认值)而不是文字 space.

rapply(strsplit(strings, " ", fixed = TRUE), function(x) grep("[0-9]+", x))
[1] 9 1 5

数据:

strings = c("Hello I'd like to extract where the first 1010 is in this string", 
            "80111 is in this string", "extract where the first 97865 is in this string")

这是另一种方法。我们可以 trim 关闭第一个数字的第一个数字之后的剩余字符。然后,找到最后一个词的位置。 \b 匹配单词边界,而 \S+ 匹配一个或多个 non-whitespace 个字符。

first_numeric_word <- function(x) {
  x <- substr(x, 1L, regexpr("\b\d+\b", x))
  lengths(gregexpr("\b\S+\b", x))
}

输出

> first_numeric_word(x)
[1] 9 1 5

数据

x <- c(
  "Hello I'd like to extract where  the first 1010 is in this string", 
  "80111 is in this string", 
  "extract where the   first  97865 is in this string"
)

这里我将留下一个完整的tidyverse方法:

library(purrr)
library(stringr)

map_dbl(str_split(strings, " "), str_which, "\d+")
#> [1] 9 1 5

map_dbl(str_split(strings[1], " "), str_which, "\d+")
#> [1] 9

请注意,它适用于一个和多个字符串。


其中 strings 是:

strings <- c("Hello I'd like to extract where the first 1010 is in this string",
             "80111 is in this string",
             "extract where the first 97865 is in this string")