您如何删除远程 (postgresql) table 上的排序?
How do you remove ordering on remote (postgresql) table?
在远程源中的 table 上调用 dplyr::arrange()
会添加一个 'Ordered by: ...' 标志。是否有后续函数删除远程 table 上的此 'Ordered by:' 标志?
考虑示例数据:
tmp_cars_sdf <-
copy_to(con_psql, cars, name = "tmp_cars_sdf", overwrite = T)
为此:
glimpse(tmp_cars_sdf)
# Observations: ??
# Variables: 2
# Database: postgres 9.5.3
# $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13...
# $ dist <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26...
考虑:
tmp_cars <-
cars
tmp_cars <-
tmp_cars %>%
arrange(speed, dist)
glimpse(tmp_cars)
# Observations: 50
# Variables: 2
# $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13,...
# $ dist <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34, 34,...
但是:
tmp_cars <-
tmp_cars_sdf %>%
arrange(speed, dist)
glimpse(tmp_cars)
# Observations: ??
# Variables: 2
# Database: postgres 9.5.3
# Ordered by: speed, dist
# $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13,...
# $ dist <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34, 34,...
dbplyr
倾向于通过添加命令来嵌套子查询。因此,当您添加更多命令时,较早的 arrange
可能会出现在子查询中。这似乎是潜在的问题。
删除这些的一个选项是直接呈现和编辑基础 SQL 查询。也许像下面这样:
unarrange = function(table, cols_prev_ordered_by){
db_connection = table$src$con
order_text = paste0("ORDER BY \"",
paste0(cols_prev_ordered_by, collapse = \", \""),
"\"")
query_text = table %>% sql_render() %>% as.character()
new_query_text = gsub(order_text, "", query_text)
sql_query = build_sql(con = db_connection, new_query_text)
return(tbl(db_connection, sql(sql_query)))
}
# example:
tmp_cars <-
tmp_cars_sdf %>%
arrange(speed, dist)
unarrange(c("speed", "dist"))
肯定有比 gsub
更可靠的方法来识别和删除查询的排序部分。如果这很重要,您可能需要查看 ?select_query
,因为它有一个明确的 order_by
参数。
受 Simon 的回答和对 OP 的评论的启发,以下函数是一种解决方法,它删除了所有排序(但保留了作为排序结果计算的任何新列)。这可能不是最有效或 low-level/direct 的方法,我将在本答案的末尾返回,但我会让 dbplyr
团队解决 my issue 如果他们认为合适。
unarrange <-
function(remote_df) {
existing_groups <- groups(remote_df)
remote_df <-
remote_df %>%
compute()
remote_df <-
tbl(remote_df$src$con,
sql_render(remote_df))
remote_df <-
group_by(remote_df, !!!existing_groups)
return(remote_df)
}
为什么有效
输入数据:
tmp_cars_sdf <-
copy_to(con_psql, cars, name = "tmp_cars_sdf", overwrite = T)
考虑
str(tmp_cars_sdf)
# ..$ con <truncated>
# ..$ disco <truncated>
# $ ops:List of 2
# ..$ x : 'ident' chr "tmp_cars_sdf"
# ..$ vars: chr [1:2] "speed" "dist"
# ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
# - attr(*, "class")= chr [1:5] "tbl_PostgreSQLConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" ...
对比
tmp_cars_sdf <-
tmp_cars_sdf %>%
arrange(speed, dist)
str(tmp_cars_sdf)
# $ ops:List of 4
# ..$ name: chr "arrange"
# ..$ x :List of 2
# .. ..$ x : 'ident' chr "tmp_cars_sdf"
# .. ..$ vars: chr [1:2] "speed" "dist"
# .. ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
# ..$ dots:List of 2
# .. ..$ : language ~speed
# .. .. ..- attr(*, ".Environment")=<environment: 0x000000002556b260>
# .. ..$ : language ~dist
# .. .. ..- attr(*, ".Environment")=<environment: 0x000000002556b260>
# ..$ args:List of 1
# .. ..$ .by_group: logi FALSE
# ..- attr(*, "class")= chr [1:3] "op_arrange" "op_single" "op"
# - attr(*, "class")= chr [1:5] "tbl_PostgreSQLConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" ...
很明显,通过 arrange
添加排序实际上修改了 R 对象的结构,因为远程 tables 无法进行内在排序(或分组),订单和分组信息必须存储在本地,并且仅在构建最终查询时传输。
因此,变通方法使用三个技巧:首先,使用 compute()
生成临时 table。请注意,这样做 不会 在本地重置组和排序。其次,使用 Simon 的技巧提取与这个新的 table 相对应的简单 select 查询,并覆盖现有的 table 结构,以便丢失所有分组和排序信息。为了保留组,该函数将原始组重新添加到此 table.
为什么这有用?
虽然 OP 中提供的示例用于说明问题,但出现问题的原因是突变依赖于 table 上的某些(分组)排序。一旦构建了新的列,旧的排序就不再需要了,事实上,由于 github 上的链接问题,旧的排序有时会成为一个障碍。这样的例子如下:
tmp_cars_sdf <-
copy_to(con_psql, cars, name = "tmp_cars_sdf", overwrite = T)
cars_df <-
cars %>%
arrange(speed, dist) %>%
group_by(speed) %>%
mutate(diff_dist_up = dist - lag(dist)) %>%
arrange(speed, desc(dist)) %>%
mutate(diff_dist_down = dist - lag(dist)) %>%
ungroup() %>%
arrange(speed, dist) %>%
data.frame()
这样:
head(cars_df)
# speed dist diff_dist_up diff_dist_down
# 1 4 2 NA -8
# 2 4 10 8 NA
# 3 7 4 NA -18
# 4 7 22 18 NA
# 5 8 16 NA NA
# 6 9 10 NA NA
有了新功能,我们可以远程复制:
cars_df_2 <-
tmp_cars_sdf %>%
arrange(speed, dist) %>%
group_by(speed) %>%
mutate(diff_dist_up = dist - lag(dist)) %>%
# unfortunately the next line is needed
# because of https://github.com/tidyverse/dbplyr/issues/345
unarrange() %>%
arrange(speed, desc(dist)) %>%
mutate(diff_dist_down = dist - lag(dist)) %>%
ungroup() %>%
unarrange() %>%
collect() %>%
arrange(speed, dist) %>%
data.frame()
并检查,我们看到:
identical(cars_df, cars_df_2)
# [1] TRUE
此修复可能存在的问题
第一个问题是需要调用compute()
,它会占用资源。第二个问题是必须可以修改编码远程 table 的 R 对象的结构,但我不知道查询是如何从这个结构构建的,所以我无法做到。
在远程源中的 table 上调用 dplyr::arrange()
会添加一个 'Ordered by: ...' 标志。是否有后续函数删除远程 table 上的此 'Ordered by:' 标志?
考虑示例数据:
tmp_cars_sdf <-
copy_to(con_psql, cars, name = "tmp_cars_sdf", overwrite = T)
为此:
glimpse(tmp_cars_sdf)
# Observations: ??
# Variables: 2
# Database: postgres 9.5.3
# $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13...
# $ dist <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26...
考虑:
tmp_cars <-
cars
tmp_cars <-
tmp_cars %>%
arrange(speed, dist)
glimpse(tmp_cars)
# Observations: 50
# Variables: 2
# $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13,...
# $ dist <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34, 34,...
但是:
tmp_cars <-
tmp_cars_sdf %>%
arrange(speed, dist)
glimpse(tmp_cars)
# Observations: ??
# Variables: 2
# Database: postgres 9.5.3
# Ordered by: speed, dist
# $ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13,...
# $ dist <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34, 34,...
dbplyr
倾向于通过添加命令来嵌套子查询。因此,当您添加更多命令时,较早的 arrange
可能会出现在子查询中。这似乎是潜在的问题。
删除这些的一个选项是直接呈现和编辑基础 SQL 查询。也许像下面这样:
unarrange = function(table, cols_prev_ordered_by){
db_connection = table$src$con
order_text = paste0("ORDER BY \"",
paste0(cols_prev_ordered_by, collapse = \", \""),
"\"")
query_text = table %>% sql_render() %>% as.character()
new_query_text = gsub(order_text, "", query_text)
sql_query = build_sql(con = db_connection, new_query_text)
return(tbl(db_connection, sql(sql_query)))
}
# example:
tmp_cars <-
tmp_cars_sdf %>%
arrange(speed, dist)
unarrange(c("speed", "dist"))
肯定有比 gsub
更可靠的方法来识别和删除查询的排序部分。如果这很重要,您可能需要查看 ?select_query
,因为它有一个明确的 order_by
参数。
受 Simon 的回答和对 OP 的评论的启发,以下函数是一种解决方法,它删除了所有排序(但保留了作为排序结果计算的任何新列)。这可能不是最有效或 low-level/direct 的方法,我将在本答案的末尾返回,但我会让 dbplyr
团队解决 my issue 如果他们认为合适。
unarrange <-
function(remote_df) {
existing_groups <- groups(remote_df)
remote_df <-
remote_df %>%
compute()
remote_df <-
tbl(remote_df$src$con,
sql_render(remote_df))
remote_df <-
group_by(remote_df, !!!existing_groups)
return(remote_df)
}
为什么有效
输入数据:
tmp_cars_sdf <-
copy_to(con_psql, cars, name = "tmp_cars_sdf", overwrite = T)
考虑
str(tmp_cars_sdf)
# ..$ con <truncated>
# ..$ disco <truncated>
# $ ops:List of 2
# ..$ x : 'ident' chr "tmp_cars_sdf"
# ..$ vars: chr [1:2] "speed" "dist"
# ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
# - attr(*, "class")= chr [1:5] "tbl_PostgreSQLConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" ...
对比
tmp_cars_sdf <-
tmp_cars_sdf %>%
arrange(speed, dist)
str(tmp_cars_sdf)
# $ ops:List of 4
# ..$ name: chr "arrange"
# ..$ x :List of 2
# .. ..$ x : 'ident' chr "tmp_cars_sdf"
# .. ..$ vars: chr [1:2] "speed" "dist"
# .. ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
# ..$ dots:List of 2
# .. ..$ : language ~speed
# .. .. ..- attr(*, ".Environment")=<environment: 0x000000002556b260>
# .. ..$ : language ~dist
# .. .. ..- attr(*, ".Environment")=<environment: 0x000000002556b260>
# ..$ args:List of 1
# .. ..$ .by_group: logi FALSE
# ..- attr(*, "class")= chr [1:3] "op_arrange" "op_single" "op"
# - attr(*, "class")= chr [1:5] "tbl_PostgreSQLConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" ...
很明显,通过 arrange
添加排序实际上修改了 R 对象的结构,因为远程 tables 无法进行内在排序(或分组),订单和分组信息必须存储在本地,并且仅在构建最终查询时传输。
因此,变通方法使用三个技巧:首先,使用 compute()
生成临时 table。请注意,这样做 不会 在本地重置组和排序。其次,使用 Simon 的技巧提取与这个新的 table 相对应的简单 select 查询,并覆盖现有的 table 结构,以便丢失所有分组和排序信息。为了保留组,该函数将原始组重新添加到此 table.
为什么这有用?
虽然 OP 中提供的示例用于说明问题,但出现问题的原因是突变依赖于 table 上的某些(分组)排序。一旦构建了新的列,旧的排序就不再需要了,事实上,由于 github 上的链接问题,旧的排序有时会成为一个障碍。这样的例子如下:
tmp_cars_sdf <-
copy_to(con_psql, cars, name = "tmp_cars_sdf", overwrite = T)
cars_df <-
cars %>%
arrange(speed, dist) %>%
group_by(speed) %>%
mutate(diff_dist_up = dist - lag(dist)) %>%
arrange(speed, desc(dist)) %>%
mutate(diff_dist_down = dist - lag(dist)) %>%
ungroup() %>%
arrange(speed, dist) %>%
data.frame()
这样:
head(cars_df)
# speed dist diff_dist_up diff_dist_down
# 1 4 2 NA -8
# 2 4 10 8 NA
# 3 7 4 NA -18
# 4 7 22 18 NA
# 5 8 16 NA NA
# 6 9 10 NA NA
有了新功能,我们可以远程复制:
cars_df_2 <-
tmp_cars_sdf %>%
arrange(speed, dist) %>%
group_by(speed) %>%
mutate(diff_dist_up = dist - lag(dist)) %>%
# unfortunately the next line is needed
# because of https://github.com/tidyverse/dbplyr/issues/345
unarrange() %>%
arrange(speed, desc(dist)) %>%
mutate(diff_dist_down = dist - lag(dist)) %>%
ungroup() %>%
unarrange() %>%
collect() %>%
arrange(speed, dist) %>%
data.frame()
并检查,我们看到:
identical(cars_df, cars_df_2)
# [1] TRUE
此修复可能存在的问题
第一个问题是需要调用compute()
,它会占用资源。第二个问题是必须可以修改编码远程 table 的 R 对象的结构,但我不知道查询是如何从这个结构构建的,所以我无法做到。