R:读取带十进制逗号的 csv 数字,包 sparklyr
R :Read csv numeric with comma in decimal, package sparklyr
我需要使用库 "sparklyr" 读取“.csv”类型的文件,其中数值以逗号显示。这个想法是能够直接使用 "spark_read_csv()" 进行阅读。
我正在使用:
library(sparklyr)
library(dplyr)
f<-data.frame(DNI=c("22-e","EE-4","55-W"),
DD=c("33,2","33.2","14,55"),CC=c("2","44,4","44,9"))
write.csv(f,"aff.csv")
sc <- spark_connect(master = "local", spark_home = "/home/tomas/spark-2.1.0-bin-hadoop2.7/", version = "2.1.0")
df <- spark_read_csv(sc, name = "data", path = "/home/tomas/Documentos/Clusterapp/aff.csv", header = TRUE, delimiter = ",")
tbl <- sdf_copy_to(sc = sc, x =df , overwrite = T)
问题,将数字读作因子
您可以将数字中的“,”替换为“.”。并将它们转换为数字。例如
df$DD<-as.numeric(gsub(pattern = ",",replacement = ".",x = df$DD))
有帮助吗?
要在 spark df 中操作字符串,您可以使用此处提到的 regexp_replace
函数:
https://spark.rstudio.com/guides/textmining/
对于你的问题,它会这样解决:
tbl <- sdf_copy_to(sc = sc, x =df, overwrite = T)
tbl0<-tbl%>%
mutate(DD=regexp_replace(DD,",","."),CC=regexp_replace(CC,",","."))%>%
mutate_at(vars(c("DD","CC")),as.numeric)
检查结果:
> glimpse(tbl0)
Observations: ??
Variables: 3
$ DNI <chr> "22-e", "EE-4", "55-W"
$ DD <dbl> 33.20, 33.20, 14.55
$ CC <dbl> 2.0, 44.4, 44.9
如果您不想将其替换为“.”也许你可以试试这个。
检查文档。使用 escape 参数指定您要忽略的字符。
在这种情况下尝试使用:
df <- spark_read_csv(sc, name = "data", path = "/home/tomas/Documentos/Clusterapp/aff.csv", header = TRUE, delimiter = ",", escape = "\,").
我需要使用库 "sparklyr" 读取“.csv”类型的文件,其中数值以逗号显示。这个想法是能够直接使用 "spark_read_csv()" 进行阅读。
我正在使用:
library(sparklyr)
library(dplyr)
f<-data.frame(DNI=c("22-e","EE-4","55-W"),
DD=c("33,2","33.2","14,55"),CC=c("2","44,4","44,9"))
write.csv(f,"aff.csv")
sc <- spark_connect(master = "local", spark_home = "/home/tomas/spark-2.1.0-bin-hadoop2.7/", version = "2.1.0")
df <- spark_read_csv(sc, name = "data", path = "/home/tomas/Documentos/Clusterapp/aff.csv", header = TRUE, delimiter = ",")
tbl <- sdf_copy_to(sc = sc, x =df , overwrite = T)
问题,将数字读作因子
您可以将数字中的“,”替换为“.”。并将它们转换为数字。例如
df$DD<-as.numeric(gsub(pattern = ",",replacement = ".",x = df$DD))
有帮助吗?
要在 spark df 中操作字符串,您可以使用此处提到的 regexp_replace
函数:
https://spark.rstudio.com/guides/textmining/
对于你的问题,它会这样解决:
tbl <- sdf_copy_to(sc = sc, x =df, overwrite = T)
tbl0<-tbl%>%
mutate(DD=regexp_replace(DD,",","."),CC=regexp_replace(CC,",","."))%>%
mutate_at(vars(c("DD","CC")),as.numeric)
检查结果:
> glimpse(tbl0)
Observations: ??
Variables: 3
$ DNI <chr> "22-e", "EE-4", "55-W"
$ DD <dbl> 33.20, 33.20, 14.55
$ CC <dbl> 2.0, 44.4, 44.9
如果您不想将其替换为“.”也许你可以试试这个。
检查文档。使用 escape 参数指定您要忽略的字符。
在这种情况下尝试使用:
df <- spark_read_csv(sc, name = "data", path = "/home/tomas/Documentos/Clusterapp/aff.csv", header = TRUE, delimiter = ",", escape = "\,").