如何对 sparkR 中的列使用 substr() 函数

Question

如何对 sparkR 中的数据框列使用 substr() 函数

+----------+----------------+-----------+
|   cust_id|  tran_datetime |Total_trans|
+----------+----------------+-----------+
|CQ98901297|2015-06-06 09:00|          1|
|CQ98901297|2015-05-01 09:25|          1|
|CQ98901297|2015-05-02 10:45|          1|
|CQ98901297|2015-05-03 11:01|          1|

我需要 trim 在 tran_datetime 栏中超时

Answer 1

#use substr(df, start position, End position) in the select() function
df_new <- select(df, df$cust_id , substr(df$tran_datetime, 1, 10), df$Total_trans)
#In the df_new you get a random column name for the column where you used substr(), so use rename() to get the desired column name
df_new <- rename(df_new, date = df_new[[2]])

showDF(df_new)

+----------+----------+-----------+
|   cust_id|  date    |Total_trans|
+----------+----------+-----------+
|CQ98901297|2015-06-06|          1|
|CQ98901297|2015-05-01|          1|
|CQ98901297|2015-05-02|          1|
|CQ98901297|2015-05-03|          1|

Answer 2

我想最好的解决方案是应用 strsplit。

x <- data.frame(lin=c('+----------+----------------+-----------+',
                      '|   cust_id|  tran_datetime |Total_trans|',
                      '+----------+----------------+-----------+',
                      '|CQ98901297|2015-06-06 09:00|          1|',
                      '|CQ98901297|2015-05-01 09:25|          1|',
                      '|CQ98901297|2015-05-02 10:45|          1|'),
                id = 1:6,
                stringsAsFactors = F)
#removing the lines that starts with +
x <- x[substr(x$lin,1,1)!="+",]
# spliting the line into columns pipe-separed
y <- strsplit(x$lin,split = "\|")
#removing whitespaces after split
library(stringr)
y <- lapply(y, function(x){str_trim(x,'both')})
# [,-1] because the first column is empty
y <- do.call(rbind,y)[,-1]
colnames(y) <- y[1,]
y <- data.frame(y[-1,],stringsAsFactors = F)
y

如何对 sparkR 中的列使用 substr() 函数

How to use substr() function to a column in sparkR

apache-spark

sparkr