将字符向量列表转换为数字向量列表的快速方法
Fast way to convert a list of character vectors to a list of numeric vectors
我想快速将字符向量列表转换为数值向量列表。我试图避免 purrr::map()
、lapply()
等。我从 stringr
操作的输出中得到了一个字符向量列表。我愿意使用 Rcpp
或 R 的内部 C 语言。它用于 filesstrings
包。 C++ 标准库提供 <string>
中定义的 stod()
但它的行为不像 as.numeric()
,例如它将“12a”转换为数字 12,但我喜欢 as.numeric()
returns NA
为此。这就是我现在的做法。
nums_as_chars <- stringr::str_extract_all(c("a1b2", "c3d4e5", "xyz"), "\d")
nums_as_chars
#> [[1]]
#> [1] "1" "2"
#>
#> [[2]]
#> [1] "3" "4" "5"
#>
#> [[3]]
#> character(0)
nums <- purrr::map(nums_as_chars, as.numeric)
nums
#> [[1]]
#> [1] 1 2
#>
#> [[2]]
#> [1] 3 4 5
#>
#> [[3]]
#> numeric(0)
由 reprex package (v0.2.0) 创建于 2018-04-04。
您没有提供适合合理基准的任何内容。所以,自己测试一下:
relist(as.numeric(unlist(nums_as_chars)),
nums_as_chars)
#[[1]]
#[1] 1 2
#
#[[2]]
#[1] 3 4 5
#
#[[3]]
#numeric(0)
好的,所以我用 3000 个字符串创建了一个更真实的示例来做一些分析,并尝试了我能想到的所有方法。
character_vector <- rep(c("a1b2", "c3d4e5", "xyz"), 1000)
try_1 <- function(chars){
extracted_numbers <- stringr::str_extract_all(chars, "\d")
lapply(extracted_numbers, as.numeric)
}
try_2 <- function(chars){
extracted_numbers <- stringr::str_extract_all(chars, "\d")
purrr::map(extracted_numbers, as.numeric)
}
try_3 <- function(chars){
extracted_numbers <- stringr::str_extract_all(chars, "\d")
relist(as.numeric(unlist(extracted_numbers)), extracted_numbers)
}
try_4 <- function(chars){
convert_fun <- function(x){as.numeric(stringr::str_extract_all(x, "\d")[[1]])}
lapply(chars, convert_fun)
}
# if you don't need to keep the list ...
try_5 <- function(chars){
extracted_numbers <- stringr::str_extract_all(chars, "\d")
suppressWarnings(as.numeric(unlist(extracted_numbers)))
}
microbenchmark::microbenchmark(try_1(character_vector),
try_2(character_vector),
try_3(character_vector),
try_4(character_vector),
try_5(character_vector))
#> Unit: milliseconds
#> expr min lq mean median
#> try_1(character_vector) 2.701769 2.866486 3.304917 3.005177
#> try_2(character_vector) 3.936557 4.295735 4.872892 4.391737
#> try_3(character_vector) 12.441844 13.317455 15.759840 14.250013
#> try_4(character_vector) 183.180143 187.789907 191.298661 190.073565
#> try_5(character_vector) 1.846848 1.964761 2.090801 2.026860
#> uq max neval
#> 3.275250 10.569255 100
#> 4.726425 17.007687 100
#> 16.995679 49.983457 100
#> 193.012754 215.532544 100
#> 2.105214 4.396379 100
注意单位是毫秒,对于 3000 个条目,完成 3000 个列表需要 lapply
3 毫秒。这对我来说似乎不合理。
purrr::map
的解决方案非常接近lapply
,然后@roland的解决方案更长,然后我的第一个想法是很多更糟。如果您不关心 list 结构(我想您会关心),那么您可以减少到 2 毫秒。
The C++ standard lib provides stod()
defined in <string>
but it doesn't behave like as.numeric()
, for example it converts "12a"
to the number 12
, but I like how as.numeric()
returns NA
for that.
这部分很容易解决:只需use the second parameter of the function验证输入是否被消费:
Rcpp::NumericVector as_numeric(std::string const& str) {
std::size_t pos;
double value = std::stod(&str[0], &pos);
return NumericVector::create(pos == str.size() ? value : NA_REAL);
}
〉 as_numeric('12')
[1] 12
〉 as_numeric('12a')
[1] NA
…显然这应该被矢量化以提高性能。
这是一个基本解决方案,对于小示例来说速度更快,但对于@rmflight 的扩展基准测试来说速度较慢:
mm <- function(chars){
lapply(strsplit(chars,"[a-z]"),function(x) as.numeric(x[x!=""]))
}
character_vector <- c("a1b2", "c3d4e5", "xyz")
mm(character_vector)
# [[1]]
# [1] 1 2
#
# [[2]]
# [1] 3 4 5
#
# [[3]]
# numeric(0)
microbenchmark::microbenchmark(try_1(character_vector),
try_2(character_vector),
try_3(character_vector),
try_4(character_vector),
try_5(character_vector),
mm(character_vector))
# Unit: microseconds
# expr min lq mean median uq max neval
# try_1(character_vector) 74.007 83.0690 96.18563 93.2630 102.1365 158.962 100
# try_2(character_vector) 375.694 409.4875 455.99043 444.7915 489.5345 853.335 100
# try_3(character_vector) 97.416 110.6315 128.83893 128.5670 141.9715 218.620 100
# try_4(character_vector) 211.823 229.7590 251.82474 244.6740 257.5115 412.696 100
# try_5(character_vector) 71.363 94.0180 121.96305 113.4635 141.4050 255.245 100
# mm(character_vector) 14.726 23.0330 62.70177 26.4315 29.2630 3652.345 100
character_vector <- rep(c("a1b2", "c3d4e5", "xyz"), 1000)
microbenchmark::microbenchmark(try_1(character_vector),
try_2(character_vector),
try_3(character_vector),
try_4(character_vector),
try_5(character_vector),
mm(character_vector))
# Unit: microseconds
# expr min lq mean median uq max neval
# try_1(character_vector) 3286.091 3466.764 4080.628 3579.2825 3836.604 30358.295 100
# try_2(character_vector) 5013.525 5290.859 6081.001 5526.0920 5878.942 32019.653 100
# try_3(character_vector) 18456.932 19466.395 22987.824 20040.6965 21318.998 65432.958 100
# try_4(character_vector) 215098.270 224078.287 245975.607 237691.9815 259353.446 356781.135 100
# try_5(character_vector) 906.951 1044.013 1106.771 1063.8365 1105.559 1649.276 100
# mm(character_vector) 7550.495 7872.383 9204.357 8215.6040 8733.268 36034.853 100
仅供参考,另一个解决方案,但速度较慢:
mm2 <- function(chars){
lapply(gsub("[a-z]"," ",vec),function(x) scan(text=x,what=double(),quiet=TRUE))
}
mm2(character_vector)
# [[1]]
# [1] 1 2
#
# [[2]]
# [1] 3 4 5
#
# [[3]]
# numeric(0)
受@Konrad 回答的启发,我使用 Rcpp
编写了以下代码。
NumericVector char_to_num(CharacterVector x) {
std::size_t n = x.size();
if (n == 0) return NumericVector(0);
NumericVector out(n);
for (std::size_t i = 0; i != n; ++i) {
std::string x_i(x[i]);
double number = NA_REAL;
try {
std::size_t pos;
number = std::stod(x_i, &pos);
number = ((pos == x_i.size()) ? number : NA_REAL);
} catch (const std::invalid_argument& e) {
; // do nothing
}
out[i] = number;
}
return out;
}
// [[Rcpp::export]]
List lst_char_to_num(List x) {
std::size_t n = x.size();
List out(n);
for (std::size_t i = 0; i != n; ++i)
out[i] = char_to_num(x[i]);
return out;
}
这个 lst_char_to_num()
结果是最佳答案。我将它与我迄今为止最喜欢的答案进行比较,这些答案是来自@rmflight 的 try1
、try2
和 try3
。 try1
是迄今为止最快的(在大数据集上,这是我担心的)。我把 stringr
操作排除在时间之外,因为我想纯粹评估列表转换的速度。
character_vector <- rep(c("a1b2", "c3d4e5", "xyz"), 1000)
extracted_numbers <- stringr::str_extract_all(character_vector, "\d")
try_1 <- function(char_list) {
lapply(char_list, as.numeric)
}
try_2 <- function(char_list) {
purrr::map(char_list, as.numeric)
}
try_3 <- function(char_list) {
relist(as.numeric(unlist(char_list)), char_list)
}
microbenchmark::microbenchmark(try_1(extracted_numbers),
try_2(extracted_numbers),
try_3(extracted_numbers),
lst_char_to_num(extracted_numbers),
times = 1000)
Unit: microseconds
expr min lq mean median uq max neval cld
try_1(extracted_numbers) 1068.823 1334.9060 1518.7589 1477.7825 1559.791 5318.318 1000 b
try_2(extracted_numbers) 2029.832 2581.6655 2974.4126 2856.8560 3057.930 9846.862 1000 c
try_3(extracted_numbers) 10015.929 12261.6405 14043.5922 13188.8465 14802.795 165217.152 1000 d
lst_char_to_num(extracted_numbers) 500.858 681.5895 827.5021 765.9505 830.311 6744.985 1000 a
我想快速将字符向量列表转换为数值向量列表。我试图避免 purrr::map()
、lapply()
等。我从 stringr
操作的输出中得到了一个字符向量列表。我愿意使用 Rcpp
或 R 的内部 C 语言。它用于 filesstrings
包。 C++ 标准库提供 <string>
中定义的 stod()
但它的行为不像 as.numeric()
,例如它将“12a”转换为数字 12,但我喜欢 as.numeric()
returns NA
为此。这就是我现在的做法。
nums_as_chars <- stringr::str_extract_all(c("a1b2", "c3d4e5", "xyz"), "\d")
nums_as_chars
#> [[1]]
#> [1] "1" "2"
#>
#> [[2]]
#> [1] "3" "4" "5"
#>
#> [[3]]
#> character(0)
nums <- purrr::map(nums_as_chars, as.numeric)
nums
#> [[1]]
#> [1] 1 2
#>
#> [[2]]
#> [1] 3 4 5
#>
#> [[3]]
#> numeric(0)
由 reprex package (v0.2.0) 创建于 2018-04-04。
您没有提供适合合理基准的任何内容。所以,自己测试一下:
relist(as.numeric(unlist(nums_as_chars)),
nums_as_chars)
#[[1]]
#[1] 1 2
#
#[[2]]
#[1] 3 4 5
#
#[[3]]
#numeric(0)
好的,所以我用 3000 个字符串创建了一个更真实的示例来做一些分析,并尝试了我能想到的所有方法。
character_vector <- rep(c("a1b2", "c3d4e5", "xyz"), 1000)
try_1 <- function(chars){
extracted_numbers <- stringr::str_extract_all(chars, "\d")
lapply(extracted_numbers, as.numeric)
}
try_2 <- function(chars){
extracted_numbers <- stringr::str_extract_all(chars, "\d")
purrr::map(extracted_numbers, as.numeric)
}
try_3 <- function(chars){
extracted_numbers <- stringr::str_extract_all(chars, "\d")
relist(as.numeric(unlist(extracted_numbers)), extracted_numbers)
}
try_4 <- function(chars){
convert_fun <- function(x){as.numeric(stringr::str_extract_all(x, "\d")[[1]])}
lapply(chars, convert_fun)
}
# if you don't need to keep the list ...
try_5 <- function(chars){
extracted_numbers <- stringr::str_extract_all(chars, "\d")
suppressWarnings(as.numeric(unlist(extracted_numbers)))
}
microbenchmark::microbenchmark(try_1(character_vector),
try_2(character_vector),
try_3(character_vector),
try_4(character_vector),
try_5(character_vector))
#> Unit: milliseconds
#> expr min lq mean median
#> try_1(character_vector) 2.701769 2.866486 3.304917 3.005177
#> try_2(character_vector) 3.936557 4.295735 4.872892 4.391737
#> try_3(character_vector) 12.441844 13.317455 15.759840 14.250013
#> try_4(character_vector) 183.180143 187.789907 191.298661 190.073565
#> try_5(character_vector) 1.846848 1.964761 2.090801 2.026860
#> uq max neval
#> 3.275250 10.569255 100
#> 4.726425 17.007687 100
#> 16.995679 49.983457 100
#> 193.012754 215.532544 100
#> 2.105214 4.396379 100
注意单位是毫秒,对于 3000 个条目,完成 3000 个列表需要 lapply
3 毫秒。这对我来说似乎不合理。
purrr::map
的解决方案非常接近lapply
,然后@roland的解决方案更长,然后我的第一个想法是很多更糟。如果您不关心 list 结构(我想您会关心),那么您可以减少到 2 毫秒。
The C++ standard lib provides
stod()
defined in<string>
but it doesn't behave likeas.numeric()
, for example it converts"12a"
to the number12
, but I like howas.numeric()
returnsNA
for that.
这部分很容易解决:只需use the second parameter of the function验证输入是否被消费:
Rcpp::NumericVector as_numeric(std::string const& str) {
std::size_t pos;
double value = std::stod(&str[0], &pos);
return NumericVector::create(pos == str.size() ? value : NA_REAL);
}
〉 as_numeric('12')
[1] 12
〉 as_numeric('12a')
[1] NA
…显然这应该被矢量化以提高性能。
这是一个基本解决方案,对于小示例来说速度更快,但对于@rmflight 的扩展基准测试来说速度较慢:
mm <- function(chars){
lapply(strsplit(chars,"[a-z]"),function(x) as.numeric(x[x!=""]))
}
character_vector <- c("a1b2", "c3d4e5", "xyz")
mm(character_vector)
# [[1]]
# [1] 1 2
#
# [[2]]
# [1] 3 4 5
#
# [[3]]
# numeric(0)
microbenchmark::microbenchmark(try_1(character_vector),
try_2(character_vector),
try_3(character_vector),
try_4(character_vector),
try_5(character_vector),
mm(character_vector))
# Unit: microseconds
# expr min lq mean median uq max neval
# try_1(character_vector) 74.007 83.0690 96.18563 93.2630 102.1365 158.962 100
# try_2(character_vector) 375.694 409.4875 455.99043 444.7915 489.5345 853.335 100
# try_3(character_vector) 97.416 110.6315 128.83893 128.5670 141.9715 218.620 100
# try_4(character_vector) 211.823 229.7590 251.82474 244.6740 257.5115 412.696 100
# try_5(character_vector) 71.363 94.0180 121.96305 113.4635 141.4050 255.245 100
# mm(character_vector) 14.726 23.0330 62.70177 26.4315 29.2630 3652.345 100
character_vector <- rep(c("a1b2", "c3d4e5", "xyz"), 1000)
microbenchmark::microbenchmark(try_1(character_vector),
try_2(character_vector),
try_3(character_vector),
try_4(character_vector),
try_5(character_vector),
mm(character_vector))
# Unit: microseconds
# expr min lq mean median uq max neval
# try_1(character_vector) 3286.091 3466.764 4080.628 3579.2825 3836.604 30358.295 100
# try_2(character_vector) 5013.525 5290.859 6081.001 5526.0920 5878.942 32019.653 100
# try_3(character_vector) 18456.932 19466.395 22987.824 20040.6965 21318.998 65432.958 100
# try_4(character_vector) 215098.270 224078.287 245975.607 237691.9815 259353.446 356781.135 100
# try_5(character_vector) 906.951 1044.013 1106.771 1063.8365 1105.559 1649.276 100
# mm(character_vector) 7550.495 7872.383 9204.357 8215.6040 8733.268 36034.853 100
仅供参考,另一个解决方案,但速度较慢:
mm2 <- function(chars){
lapply(gsub("[a-z]"," ",vec),function(x) scan(text=x,what=double(),quiet=TRUE))
}
mm2(character_vector)
# [[1]]
# [1] 1 2
#
# [[2]]
# [1] 3 4 5
#
# [[3]]
# numeric(0)
受@Konrad 回答的启发,我使用 Rcpp
编写了以下代码。
NumericVector char_to_num(CharacterVector x) {
std::size_t n = x.size();
if (n == 0) return NumericVector(0);
NumericVector out(n);
for (std::size_t i = 0; i != n; ++i) {
std::string x_i(x[i]);
double number = NA_REAL;
try {
std::size_t pos;
number = std::stod(x_i, &pos);
number = ((pos == x_i.size()) ? number : NA_REAL);
} catch (const std::invalid_argument& e) {
; // do nothing
}
out[i] = number;
}
return out;
}
// [[Rcpp::export]]
List lst_char_to_num(List x) {
std::size_t n = x.size();
List out(n);
for (std::size_t i = 0; i != n; ++i)
out[i] = char_to_num(x[i]);
return out;
}
这个 lst_char_to_num()
结果是最佳答案。我将它与我迄今为止最喜欢的答案进行比较,这些答案是来自@rmflight 的 try1
、try2
和 try3
。 try1
是迄今为止最快的(在大数据集上,这是我担心的)。我把 stringr
操作排除在时间之外,因为我想纯粹评估列表转换的速度。
character_vector <- rep(c("a1b2", "c3d4e5", "xyz"), 1000)
extracted_numbers <- stringr::str_extract_all(character_vector, "\d")
try_1 <- function(char_list) {
lapply(char_list, as.numeric)
}
try_2 <- function(char_list) {
purrr::map(char_list, as.numeric)
}
try_3 <- function(char_list) {
relist(as.numeric(unlist(char_list)), char_list)
}
microbenchmark::microbenchmark(try_1(extracted_numbers),
try_2(extracted_numbers),
try_3(extracted_numbers),
lst_char_to_num(extracted_numbers),
times = 1000)
Unit: microseconds
expr min lq mean median uq max neval cld
try_1(extracted_numbers) 1068.823 1334.9060 1518.7589 1477.7825 1559.791 5318.318 1000 b
try_2(extracted_numbers) 2029.832 2581.6655 2974.4126 2856.8560 3057.930 9846.862 1000 c
try_3(extracted_numbers) 10015.929 12261.6405 14043.5922 13188.8465 14802.795 165217.152 1000 d
lst_char_to_num(extracted_numbers) 500.858 681.5895 827.5021 765.9505 830.311 6744.985 1000 a