Sparklyr Spark SQL 基于多个通配符的过滤器
Sparklyr Spark SQL Filter based on multiple wildcards
使用 sparklyr,我试图找到一种方法来过滤结合 rlike
和 %in%
功能的 Spark 数据帧。这是一个最小的工作示例:
# start a Spark session in R and have dplyr loaded
# create a spark dataframe
df <- data.frame(names = c("Brandon", "Chee", "Brandi", "Firouz", "Eric", "Erin"),
place = c("Pasadena", "South Bay", "West Hollywood", "SF Valley", "South Bay", "South Bay"))
sc_df <- sdf_copy_to(sc, df, overwrite = TRUE)
# set wildcard filter paramaters
f_params <- c("Brand", "Er")
# return all rows of sc_df where the 'names' value contains either 'f_params' values.
df_filtered <- sc_df %>%
filter(rlike(names, f_params)) %>%
collect()
上面代码中的 df_filtered
显然失败了。理想情况下,df_filtered
table 看起来像:
print(df_filtered)
# names place
# Brandon Pasadena
# Brandi West Hollywood
# Eric South Bay
# Erin South Bay
附加规则:因为真实例子中包含了大约200个值在f_params
,我不能使用下面的解决方案:
df_filtered <- sc_df %>%
filter(rlike(names, "Brand") | rlike(names, "Er")) %>%
collect()
提前致谢。
I can't use multiple rlike() statements separated with | (OR) because the real example includes about 200 values in f_params
这听起来像是一个相当人为的约束,但如果您真的想避免使用单个正则表达式,您总是可以编写一个显式析取:
library(rlang)
sc_df %>%
filter(!!rlang::parse_quo(glue::glue_collapse(glue::glue(
"(names %rlike% '{f_params}')"),
" %or% " # or " | "
), rlang::caller_env()))
# Source: spark<?> [?? x 2]
names place
<chr> <chr>
1 Brandon Pasadena
2 Brandi West Hollywood
3 Eric South Bay
4 Erin South Bay
如果保证 f_params
是有效的正则表达式文字,那么简单地连接字符串应该会快得多:
sc_df %>%
filter(names %rlike% glue::glue_collapse(glue::glue("{f_params}"), "|"))
# Source: spark<?> [?? x 2]
names place
<chr> <chr>
1 Brandon Pasadena
2 Brandi West Hollywood
3 Eric South Bay
4 Erin South Bay
如果没有,您可以尝试先申请Hmisc::escapeRegexp
:
sc_df %>%
filter(
names %rlike% glue::glue_collapse(glue::glue(
"{Hmisc::escapeRegex(f_params)}"
), "|")
)
但请记住,Spark 使用 Java 正则表达式,因此它可能无法涵盖某些边缘情况。
使用 sparklyr,我试图找到一种方法来过滤结合 rlike
和 %in%
功能的 Spark 数据帧。这是一个最小的工作示例:
# start a Spark session in R and have dplyr loaded
# create a spark dataframe
df <- data.frame(names = c("Brandon", "Chee", "Brandi", "Firouz", "Eric", "Erin"),
place = c("Pasadena", "South Bay", "West Hollywood", "SF Valley", "South Bay", "South Bay"))
sc_df <- sdf_copy_to(sc, df, overwrite = TRUE)
# set wildcard filter paramaters
f_params <- c("Brand", "Er")
# return all rows of sc_df where the 'names' value contains either 'f_params' values.
df_filtered <- sc_df %>%
filter(rlike(names, f_params)) %>%
collect()
上面代码中的 df_filtered
显然失败了。理想情况下,df_filtered
table 看起来像:
print(df_filtered)
# names place
# Brandon Pasadena
# Brandi West Hollywood
# Eric South Bay
# Erin South Bay
附加规则:因为真实例子中包含了大约200个值在f_params
,我不能使用下面的解决方案:
df_filtered <- sc_df %>%
filter(rlike(names, "Brand") | rlike(names, "Er")) %>%
collect()
提前致谢。
I can't use multiple rlike() statements separated with | (OR) because the real example includes about 200 values in f_params
这听起来像是一个相当人为的约束,但如果您真的想避免使用单个正则表达式,您总是可以编写一个显式析取:
library(rlang)
sc_df %>%
filter(!!rlang::parse_quo(glue::glue_collapse(glue::glue(
"(names %rlike% '{f_params}')"),
" %or% " # or " | "
), rlang::caller_env()))
# Source: spark<?> [?? x 2]
names place
<chr> <chr>
1 Brandon Pasadena
2 Brandi West Hollywood
3 Eric South Bay
4 Erin South Bay
如果保证 f_params
是有效的正则表达式文字,那么简单地连接字符串应该会快得多:
sc_df %>%
filter(names %rlike% glue::glue_collapse(glue::glue("{f_params}"), "|"))
# Source: spark<?> [?? x 2]
names place
<chr> <chr>
1 Brandon Pasadena
2 Brandi West Hollywood
3 Eric South Bay
4 Erin South Bay
如果没有,您可以尝试先申请Hmisc::escapeRegexp
:
sc_df %>%
filter(
names %rlike% glue::glue_collapse(glue::glue(
"{Hmisc::escapeRegex(f_params)}"
), "|")
)
但请记住,Spark 使用 Java 正则表达式,因此它可能无法涵盖某些边缘情况。