在字符串的 R 数据帧列中查找最大数

Find the maximum number in an R dataframe column of strings

对于数据框特定列中的每个单元格(这里我们将其简单命名为 df),我想找到最初表示为字符串的最大和最小值,嵌入到字符串中。单元格中出现的任何逗号都没有特殊意义。这些数字不应是百分比,因此如果出现例如 50%,则应将 50 排除在外。数据框的相关列如下所示:

| particular_col_name | 
| ------------------- | 
| First Row String10. This is also a string_5, and so is this 20, exclude70% |
| Second_Row_50%, number40. Number 4. number_15|

因此应创建两个新列,标题为 'maximum_number' 和 '最小数量,对于第一行,前者应分别为 20 和 5。请注意,由于旁边有 % 符号,70 已被排除。同样,第二行应将 40 和 4 放入新列中。

我在 dplyr 'mutate' 运算符中尝试了几种方法(例如 str_extract_all、regmatches、strsplit),但它们要么给出错误消息(特别是关于输入列 particular_col_name) 或者不要以适当的格式输出数据,以便于识别最大值和最小值。

如有任何帮助,我们将不胜感激。

library(tidyverse)

tibble(
  particular_col_name = c(
    "First Row String10. This is also a string_5, and so is this 20, exclude70%",
    "Second_Row_50%, number40. Number 4. number_15",
    "20% 30%"
  )
) %>%
  mutate(
    numbers = particular_col_name %>% map(~ {
      .x %>% str_remove_all("[0-9]+%") %>% str_extract_all("[0-9]+") %>% simplify() %>% as.numeric()
    }),
    min = numbers %>% map_dbl(~ .x %>% min() %>% na_if(Inf) %>% na_if(-Inf)),
    max = numbers %>% map_dbl(~ .x %>% max() %>% na_if(Inf) %>% na_if(-Inf))
  ) %>%
  select(-numbers)
#> Warning in min(.): no non-missing arguments to min; returning Inf
#> Warning in max(.): no non-missing arguments to max; returning -Inf
#> # A tibble: 3 x 3
#>   particular_col_name                                                  min   max
#>   <chr>                                                              <dbl> <dbl>
#> 1 First Row String10. This is also a string_5, and so is this 20, e…     5    20
#> 2 Second_Row_50%, number40. Number 4. number_15                          4    40
#> 3 20% 30%                                                               NA    NA

reprex package (v2.0.0)

创建于 2022-02-22

我们可以将 str_extract_allsapply 结合使用:

library(stringr)

df$min <- sapply(str_extract_all(df$particular_col_name, "[0-9]+"), function(x) min(as.integer(x)))
df$max <- sapply(str_extract_all(df$particular_col_name, "[0-9]+"), function(x) max(as.integer(x)))
  particular_col_name                                                          min   max
  <chr>                                                                      <int> <int>
1 First Row String10. This is also a string_5, and so is this 20, exclude70%     5    70
2 Second_Row_50%, number40. Number 4. number_15                                  4    50

数据:

df <- structure(list(particular_col_name = c("First Row String10. This is also a string_5, and so is this 20, exclude70%", 
"Second_Row_50%, number40. Number 4. number_15"), min = 5:4, 
    max = c(70L, 50L)), row.names = c(NA, -2L), class = c("tbl_df", 
"tbl", "data.frame"))