您如何删除远程 (postgresql) table 上的排序?

How do you remove ordering on remote (postgresql) table?

在远程源中的 table 上调用 dplyr::arrange() 会添加一个 'Ordered by: ...' 标志。是否有后续函数删除远程 table 上的此 'Ordered by:' 标志?

考虑示例数据:

tmp_cars_sdf <-
    copy_to(con_psql, cars, name = "tmp_cars_sdf", overwrite = T)

为此:

glimpse(tmp_cars_sdf)
# Observations: ??
#     Variables: 2
# Database: postgres 9.5.3
# $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13...
# $ dist  <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26...

考虑:

tmp_cars <-
    cars
tmp_cars <-
    tmp_cars %>%
    arrange(speed, dist)
glimpse(tmp_cars)

# Observations: 50
# Variables: 2
# $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13,...
# $ dist  <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34, 34,...

但是:

tmp_cars <-
    tmp_cars_sdf %>%
    arrange(speed, dist)
glimpse(tmp_cars)

# Observations: ??
#     Variables: 2
# Database: postgres 9.5.3 
# Ordered by: speed, dist
# $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13,...
# $ dist  <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34, 34,...

dbplyr 倾向于通过添加命令来嵌套子查询。因此,当您添加更多命令时,较早的 arrange 可能会出现在子查询中。这似乎是潜在的问题。

删除这些的一个选项是直接呈现和编辑基础 SQL 查询。也许像下面这样:

unarrange = function(table, cols_prev_ordered_by){

  db_connection = table$src$con

  order_text = paste0("ORDER BY \"",
                      paste0(cols_prev_ordered_by, collapse = \", \""),
                      "\"")

  query_text = table %>% sql_render() %>% as.character()
  new_query_text = gsub(order_text, "", query_text)

  sql_query = build_sql(con = db_connection, new_query_text)
  return(tbl(db_connection, sql(sql_query)))
}

# example:
tmp_cars <-
    tmp_cars_sdf %>%
    arrange(speed, dist)
    unarrange(c("speed", "dist"))

肯定有比 gsub 更可靠的方法来识别和删除查询的排序部分。如果这很重要,您可能需要查看 ?select_query,因为它有一个明确的 order_by 参数。

受 Simon 的回答和对 OP 的评论的启发,以下函数是一种解决方法,它删除了所有排序(但保留了作为排序结果计算的任何新列)。这可能不是最有效或 low-level/direct 的方法,我将在本答案的末尾返回,但我会让 dbplyr 团队解决 my issue 如果他们认为合适。

unarrange <-
    function(remote_df) {

     existing_groups <- groups(remote_df)


        remote_df <-
            remote_df %>%
            compute()

         remote_df <-
            tbl(remote_df$src$con, 
                sql_render(remote_df))


         remote_df <-
             group_by(remote_df, !!!existing_groups)



         return(remote_df)

    }

为什么有效

输入数据:

tmp_cars_sdf <-
    copy_to(con_psql, cars, name = "tmp_cars_sdf", overwrite = T)

考虑

str(tmp_cars_sdf)
# ..$ con <truncated>
# ..$ disco <truncated>
# $ ops:List of 2
# ..$ x   : 'ident' chr "tmp_cars_sdf"
# ..$ vars: chr [1:2] "speed" "dist"
# ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
# - attr(*, "class")= chr [1:5] "tbl_PostgreSQLConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" ...

对比

tmp_cars_sdf <-
    tmp_cars_sdf %>%
    arrange(speed, dist)

str(tmp_cars_sdf)
# $ ops:List of 4
# ..$ name: chr "arrange"
# ..$ x   :List of 2
# .. ..$ x   : 'ident' chr "tmp_cars_sdf"
# .. ..$ vars: chr [1:2] "speed" "dist"
# .. ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
# ..$ dots:List of 2
# .. ..$ : language ~speed
# .. .. ..- attr(*, ".Environment")=<environment: 0x000000002556b260> 
#     .. ..$ : language ~dist
# .. .. ..- attr(*, ".Environment")=<environment: 0x000000002556b260> 
#     ..$ args:List of 1
# .. ..$ .by_group: logi FALSE
# ..- attr(*, "class")= chr [1:3] "op_arrange" "op_single" "op"
# - attr(*, "class")= chr [1:5] "tbl_PostgreSQLConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" ...

很明显,通过 arrange 添加排序实际上修改了 R 对象的结构,因为远程 tables 无法进行内在排序(或分组),订单和分组信息必须存储在本地,并且仅在构建最终查询时传输。

因此,变通方法使用三个技巧:首先,使用 compute() 生成临时 table。请注意,这样做 不会 在本地重置组和排序。其次,使用 Simon 的技巧提取与这个新的 table 相对应的简单 select 查询,并覆盖现有的 table 结构,以便丢失所有分组和排序信息。为了保留组,该函数将原始组重新添加到此 table.

为什么这有用?

虽然 OP 中提供的示例用于说明问题,但出现问题的原因是突变依赖于 table 上的某些(分组)排序。一旦构建了新的列,旧的排序就不再需要了,事实上,由于 github 上的链接问题,旧的排序有时会成为一个障碍。这样的例子如下:

tmp_cars_sdf <-
    copy_to(con_psql, cars, name = "tmp_cars_sdf", overwrite = T)


cars_df <-
    cars %>%
    arrange(speed, dist) %>%
    group_by(speed) %>%
    mutate(diff_dist_up = dist - lag(dist)) %>%
    arrange(speed, desc(dist)) %>%
    mutate(diff_dist_down = dist - lag(dist)) %>%
    ungroup() %>%
    arrange(speed, dist) %>%
    data.frame()

这样:

head(cars_df)
# speed dist diff_dist_up diff_dist_down
# 1     4    2           NA             -8
# 2     4   10            8             NA
# 3     7    4           NA            -18
# 4     7   22           18             NA
# 5     8   16           NA             NA
# 6     9   10           NA             NA

有了新功能,我们可以远程复制:

cars_df_2 <-
    tmp_cars_sdf %>%
    arrange(speed, dist) %>%
    group_by(speed) %>%
    mutate(diff_dist_up = dist - lag(dist)) %>%
    # unfortunately the next line is needed
    # because of https://github.com/tidyverse/dbplyr/issues/345
    unarrange() %>%
    arrange(speed, desc(dist)) %>%
    mutate(diff_dist_down = dist - lag(dist)) %>%
    ungroup() %>%
    unarrange() %>%
    collect() %>%
    arrange(speed, dist) %>%
    data.frame()

并检查,我们看到:

identical(cars_df, cars_df_2)
# [1] TRUE

此修复可能存在的问题

第一个问题是需要调用compute(),它会占用资源。第二个问题是必须可以修改编码远程 table 的 R 对象的结构,但我不知道查询是如何从这个结构构建的,所以我无法做到。