如何从 sparklyr 中的字符串中删除'\'

Question

我正在使用 sparklyr 并且有一个 spark 数据框，其列 word 包含单词，其中一些包含我想要删除的特殊字符。我在特殊字符前使用 regepx_replace 和 \\ 是成功的，就像这样：

words.sdf <- words.sdf %>% 
  mutate(word = regexp_replace(word, '\\(', '')) %>% 
  mutate(word = regexp_replace(word, '\\)', '')) %>% 
  mutate(word = regexp_replace(word, '\\+', '')) %>% 
  mutate(word = regexp_replace(word, '\\?', '')) %>%
  mutate(word = regexp_replace(word, '\\:', '')) %>%
  mutate(word = regexp_replace(word, '\\;', '')) %>%
  mutate(word = regexp_replace(word, '\\!', ''))

现在我想删除 \。我都试过了:

words.sdf <- words.sdf %>% 
  mutate(word = regexp_replace(word, '\\\', ''))

和：

words.sdf <- words.sdf %>% 
  mutate(word = regexp_replace(word, '\', ''))

但两者都行不通...

Answer 1

您必须为 R 端和 Java 端转义更正代码，因此您实际上需要的是 "\\\\":

df <- copy_to(sc, tibble(word = "(abc\zyx: 1)"))

df %>% mutate(regexp_replace(word, "\\\\", ""))

# Source:   lazy query [?? x 2]
# Database: spark_shell_connection
  word           `regexp_replace(word, "\\\\\\\\", "")`
  <chr>          <chr>                                         
1 "(abc\zyx:1)" (abczyx: 1)

根据您的具体要求，一次匹配所有字符可能更容易。例如，您可以只保留单词字符 (\w) 和空格 (\s):

df %>% mutate(regexp_replace(word, "[^\\w+\\s+]", ""))

# Source:   lazy query [?? x 2]
# Database: spark_shell_connection
  word            `regexp_replace(word, "[^\\\\w+\\\\s+]", "")`
  <chr>           <chr>                                                
1 "(abc\zyx: 1)" abczyx 1

或仅限单词字符

df %>% mutate(regexp_replace(word, "[^\\w+]", ""))

# Source:   lazy query [?? x 2]
# Database: spark_shell_connection
  word            `regexp_replace(word, "[^\\\\w+]", "")`
  <chr>           <chr>                                      
1 "(abc\zyx: 1)" abczyx1

如何从 sparklyr 中的字符串中删除'\'

How to remove '\' from a string in sparklyr

text

r

apache-spark

sparklyr