通过 sparklyr 替换 spark 数据框中的 '\\' 或 '\\\\' 失败
Replacing '\\' or '\\\\' in spark dataframe via sparklyr fails
我尝试替换 spark 数据框中的反斜杠。我编写了一个适用于 R 数据框的函数。我将它插入 spark_apply
但它不起作用:
rm(back_slash_replace_func)
back_slash_replace_func <- function(x)
{
cbind.data.frame(
lapply(
x, function(x) { if(class(x) == 'character'){ gsub(pattern = "\", replacement = "/", x = x, fixed = T)} else { x } }
)
, stringsAsFactors = F
)
}
## do in R
x <- data.frame(x = rep('\', 10), stringsAsFactors = F)
back_slash_replace_func(x)
## do in spark
r_spark_connection <- spark_connect(master = "local")
xsp <- copy_to(r_spark_connection, x, overwrite = T)
start <- Sys.time()
spark_apply(
x = xsp
, f = back_slash_replace_func
, memory = F
)
Sys.time() - start
它没有完成任务,没有错误,没有警告。可能是什么情况?
您应该注意的第一件事是,copy_to
使您的数据格式错误。而 x
是:
x %>% head(1)
# x
# 1 \
xsp
是
xsp %>% head(1)
# # Source: lazy query [?? x 1]
# # Database: spark_connection
# x
# <chr>
# 1 "\""
这是因为当您使用 copy_to
时,spakrlyr 会将数据转储到平面文件。结果它甚至在本地也不起作用:
xsp %>% collect %>% back_slash_replace_func %>% head(1)
# x
# 1 "
如果您直接创建数据框:
df <-spark_session(r_spark_connection) %>%
invoke("sql", "SELECT '\\' AS x FROM range(10)") %>%
sdf_register()
df %>% collect %>% back_slash_replace_func %>% head(1)
# x
# 1 /
这个特殊问题不会出现。
这里的另一个问题是 spark_apply
实际上将 strings
转换为 factors
(根据 Kevin's comment this is tracked by sparklyr:1295)所以不是:
function(x) {
if (is.character(x)) {
gsub(pattern = "\", replacement = "/", x = x, fixed = T)
} else { x }
}
您更愿意需要:
function(x) {
if (is.factor(x)) {
gsub(pattern = "\", replacement = "/", x = as.character(x), fixed = T)
} else { x }
}
但实际上只是 translate
:
df %>% mutate(x = translate(x, "\\", "/")) %>% head(1)
# # Source: lazy query [?? x 1]
# # Database: spark_connection
# x
# <chr>
# 1 /
我尝试替换 spark 数据框中的反斜杠。我编写了一个适用于 R 数据框的函数。我将它插入 spark_apply
但它不起作用:
rm(back_slash_replace_func)
back_slash_replace_func <- function(x)
{
cbind.data.frame(
lapply(
x, function(x) { if(class(x) == 'character'){ gsub(pattern = "\", replacement = "/", x = x, fixed = T)} else { x } }
)
, stringsAsFactors = F
)
}
## do in R
x <- data.frame(x = rep('\', 10), stringsAsFactors = F)
back_slash_replace_func(x)
## do in spark
r_spark_connection <- spark_connect(master = "local")
xsp <- copy_to(r_spark_connection, x, overwrite = T)
start <- Sys.time()
spark_apply(
x = xsp
, f = back_slash_replace_func
, memory = F
)
Sys.time() - start
它没有完成任务,没有错误,没有警告。可能是什么情况?
您应该注意的第一件事是,copy_to
使您的数据格式错误。而 x
是:
x %>% head(1)
# x
# 1 \
xsp
是
xsp %>% head(1)
# # Source: lazy query [?? x 1]
# # Database: spark_connection
# x
# <chr>
# 1 "\""
这是因为当您使用 copy_to
时,spakrlyr 会将数据转储到平面文件。结果它甚至在本地也不起作用:
xsp %>% collect %>% back_slash_replace_func %>% head(1)
# x
# 1 "
如果您直接创建数据框:
df <-spark_session(r_spark_connection) %>%
invoke("sql", "SELECT '\\' AS x FROM range(10)") %>%
sdf_register()
df %>% collect %>% back_slash_replace_func %>% head(1)
# x
# 1 /
这个特殊问题不会出现。
这里的另一个问题是 spark_apply
实际上将 strings
转换为 factors
(根据 Kevin's comment this is tracked by sparklyr:1295)所以不是:
function(x) {
if (is.character(x)) {
gsub(pattern = "\", replacement = "/", x = x, fixed = T)
} else { x }
}
您更愿意需要:
function(x) {
if (is.factor(x)) {
gsub(pattern = "\", replacement = "/", x = as.character(x), fixed = T)
} else { x }
}
但实际上只是 translate
:
df %>% mutate(x = translate(x, "\\", "/")) %>% head(1)
# # Source: lazy query [?? x 1]
# # Database: spark_connection
# x
# <chr>
# 1 /