如何从 sparklyr 中的字符串中删除'\'

How to remove '\' from a string in sparklyr

我正在使用 sparklyr 并且有一个 spark 数据框,其列 word 包含单词,其中一些包含我想要删除的特殊字符。我在特殊字符前使用 regepx_replace\\ 是成功的,就像这样:

words.sdf <- words.sdf %>% 
  mutate(word = regexp_replace(word, '\\(', '')) %>% 
  mutate(word = regexp_replace(word, '\\)', '')) %>% 
  mutate(word = regexp_replace(word, '\\+', '')) %>% 
  mutate(word = regexp_replace(word, '\\?', '')) %>%
  mutate(word = regexp_replace(word, '\\:', '')) %>%
  mutate(word = regexp_replace(word, '\\;', '')) %>%
  mutate(word = regexp_replace(word, '\\!', ''))

现在我想删除 \。我都试过了:

words.sdf <- words.sdf %>% 
  mutate(word = regexp_replace(word, '\\\', ''))

和:

words.sdf <- words.sdf %>% 
  mutate(word = regexp_replace(word, '\', ''))

但两者都行不通...

您必须为 R 端和 Java 端转义更正代码,因此您实际上需要的是 "\\\\":

df <- copy_to(sc, tibble(word = "(abc\zyx: 1)"))

df %>% mutate(regexp_replace(word, "\\\\", ""))
# Source:   lazy query [?? x 2]
# Database: spark_shell_connection
  word           `regexp_replace(word, "\\\\\\\\", "")`
  <chr>          <chr>                                         
1 "(abc\zyx:1)" (abczyx: 1)  

根据您的具体要求,一次匹配所有字符可能更容易。例如,您可以只保留单词字符 (\w) 和空格 (\s):

df %>% mutate(regexp_replace(word, "[^\\w+\\s+]", ""))
# Source:   lazy query [?? x 2]
# Database: spark_shell_connection
  word            `regexp_replace(word, "[^\\\\w+\\\\s+]", "")`
  <chr>           <chr>                                                
1 "(abc\zyx: 1)" abczyx 1     

或仅限单词字符

df %>% mutate(regexp_replace(word, "[^\\w+]", ""))
# Source:   lazy query [?? x 2]
# Database: spark_shell_connection
  word            `regexp_replace(word, "[^\\\\w+]", "")`
  <chr>           <chr>                                      
1 "(abc\zyx: 1)" abczyx1