在 Spark 中工作时替代 ``stringr::str_detect``

Question

我在本地设备上使用 RStudio 工作了几年，最近开始使用 Spark（版本 3.0.1）。当我尝试在 Spark 中运行 stringr::str_detect() 时，我运行遇到了一个意想不到的问题。显然 str_detect() 在 SQL 中没有等价物。我正在寻找替代方案，最好是 R.

这是我在运行宁 str_detect() 本地与 Spark 中的预期结果的示例。

# Load packages
library(dplyr)
library(stringr)
library(sparklyr)

# Example tibble
df <- tibble(foodtype = c("potatosalad", "potato", "salad"))
df

---
# A tibble: 3 x 1
  foodtype   
  <chr>      
1 potatosalad
2 potato     
3 salad 
---

# Expected result when using R
df %>% 
  mutate(contains_potato = str_detect(foodtype, "potato"))

---
# A tibble: 3 x 2
  foodtype    contains_potato
  <chr>       <lgl>          
1 potatosalad TRUE           
2 potato      TRUE           
3 salad       FALSE  
---

但是当我运行将此代码放在 Spark 数据帧上时 returns 出现以下错误消息：“错误：str_detect() 在此 SQL 中不可用变体”。

# Connect to local Spark cluster
sc <- spark_connect(master = "local", version = "3.0")

# Copy tibble to Spark cluster
df_spark <- copy_to(sc, df)
df_spark

# Error when using str_detect with Spark
df_spark %>% 
  mutate(contains_potato = str_detect(foodtype, "potato"))

---
Error: str_detect() is not available in this SQL variant
---

Answer 1

str_detect()相当于Spark的rlike函数。我不将 spark 与 R 一起使用，但像这样的东西应该可以工作：

df_spark %>% mutate(contains_potato = foodtype %rlike% "potato")

dplyr 在没有 dplyr 等价物时接受编写为 R 函数的 Spark 函数：

df_spark %>% mutate(contains_potato = rlike(foodtype, "potato"))

在 Spark 中工作时替代 ``stringr::str_detect``

Alternative for ``stringr::str_detect`` when working in Spark

r

stringr

apache-spark

sparklyr