在字符串的 R 数据帧列中查找最大数
Find the maximum number in an R dataframe column of strings
对于数据框特定列中的每个单元格(这里我们将其简单命名为 df),我想找到最初表示为字符串的最大和最小值,嵌入到字符串中。单元格中出现的任何逗号都没有特殊意义。这些数字不应是百分比,因此如果出现例如 50%,则应将 50 排除在外。数据框的相关列如下所示:
| particular_col_name |
| ------------------- |
| First Row String10. This is also a string_5, and so is this 20, exclude70% |
| Second_Row_50%, number40. Number 4. number_15|
因此应创建两个新列,标题为 'maximum_number' 和 '最小数量,对于第一行,前者应分别为 20 和 5。请注意,由于旁边有 % 符号,70 已被排除。同样,第二行应将 40 和 4 放入新列中。
我在 dplyr 'mutate' 运算符中尝试了几种方法(例如 str_extract_all、regmatches、strsplit),但它们要么给出错误消息(特别是关于输入列 particular_col_name) 或者不要以适当的格式输出数据,以便于识别最大值和最小值。
如有任何帮助,我们将不胜感激。
library(tidyverse)
tibble(
particular_col_name = c(
"First Row String10. This is also a string_5, and so is this 20, exclude70%",
"Second_Row_50%, number40. Number 4. number_15",
"20% 30%"
)
) %>%
mutate(
numbers = particular_col_name %>% map(~ {
.x %>% str_remove_all("[0-9]+%") %>% str_extract_all("[0-9]+") %>% simplify() %>% as.numeric()
}),
min = numbers %>% map_dbl(~ .x %>% min() %>% na_if(Inf) %>% na_if(-Inf)),
max = numbers %>% map_dbl(~ .x %>% max() %>% na_if(Inf) %>% na_if(-Inf))
) %>%
select(-numbers)
#> Warning in min(.): no non-missing arguments to min; returning Inf
#> Warning in max(.): no non-missing arguments to max; returning -Inf
#> # A tibble: 3 x 3
#> particular_col_name min max
#> <chr> <dbl> <dbl>
#> 1 First Row String10. This is also a string_5, and so is this 20, e… 5 20
#> 2 Second_Row_50%, number40. Number 4. number_15 4 40
#> 3 20% 30% NA NA
由 reprex package (v2.0.0)
创建于 2022-02-22
我们可以将 str_extract_all
与 sapply
结合使用:
library(stringr)
df$min <- sapply(str_extract_all(df$particular_col_name, "[0-9]+"), function(x) min(as.integer(x)))
df$max <- sapply(str_extract_all(df$particular_col_name, "[0-9]+"), function(x) max(as.integer(x)))
particular_col_name min max
<chr> <int> <int>
1 First Row String10. This is also a string_5, and so is this 20, exclude70% 5 70
2 Second_Row_50%, number40. Number 4. number_15 4 50
数据:
df <- structure(list(particular_col_name = c("First Row String10. This is also a string_5, and so is this 20, exclude70%",
"Second_Row_50%, number40. Number 4. number_15"), min = 5:4,
max = c(70L, 50L)), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"))
对于数据框特定列中的每个单元格(这里我们将其简单命名为 df),我想找到最初表示为字符串的最大和最小值,嵌入到字符串中。单元格中出现的任何逗号都没有特殊意义。这些数字不应是百分比,因此如果出现例如 50%,则应将 50 排除在外。数据框的相关列如下所示:
| particular_col_name |
| ------------------- |
| First Row String10. This is also a string_5, and so is this 20, exclude70% |
| Second_Row_50%, number40. Number 4. number_15|
因此应创建两个新列,标题为 'maximum_number' 和 '最小数量,对于第一行,前者应分别为 20 和 5。请注意,由于旁边有 % 符号,70 已被排除。同样,第二行应将 40 和 4 放入新列中。
我在 dplyr 'mutate' 运算符中尝试了几种方法(例如 str_extract_all、regmatches、strsplit),但它们要么给出错误消息(特别是关于输入列 particular_col_name) 或者不要以适当的格式输出数据,以便于识别最大值和最小值。
如有任何帮助,我们将不胜感激。
library(tidyverse)
tibble(
particular_col_name = c(
"First Row String10. This is also a string_5, and so is this 20, exclude70%",
"Second_Row_50%, number40. Number 4. number_15",
"20% 30%"
)
) %>%
mutate(
numbers = particular_col_name %>% map(~ {
.x %>% str_remove_all("[0-9]+%") %>% str_extract_all("[0-9]+") %>% simplify() %>% as.numeric()
}),
min = numbers %>% map_dbl(~ .x %>% min() %>% na_if(Inf) %>% na_if(-Inf)),
max = numbers %>% map_dbl(~ .x %>% max() %>% na_if(Inf) %>% na_if(-Inf))
) %>%
select(-numbers)
#> Warning in min(.): no non-missing arguments to min; returning Inf
#> Warning in max(.): no non-missing arguments to max; returning -Inf
#> # A tibble: 3 x 3
#> particular_col_name min max
#> <chr> <dbl> <dbl>
#> 1 First Row String10. This is also a string_5, and so is this 20, e… 5 20
#> 2 Second_Row_50%, number40. Number 4. number_15 4 40
#> 3 20% 30% NA NA
由 reprex package (v2.0.0)
创建于 2022-02-22我们可以将 str_extract_all
与 sapply
结合使用:
library(stringr)
df$min <- sapply(str_extract_all(df$particular_col_name, "[0-9]+"), function(x) min(as.integer(x)))
df$max <- sapply(str_extract_all(df$particular_col_name, "[0-9]+"), function(x) max(as.integer(x)))
particular_col_name min max
<chr> <int> <int>
1 First Row String10. This is also a string_5, and so is this 20, exclude70% 5 70
2 Second_Row_50%, number40. Number 4. number_15 4 50
数据:
df <- structure(list(particular_col_name = c("First Row String10. This is also a string_5, and so is this 20, exclude70%",
"Second_Row_50%, number40. Number 4. number_15"), min = 5:4,
max = c(70L, 50L)), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"))