R DBI绑定变量性能

Question

我在将绑定变量与 DBI 包一起使用时遇到性能问题。我最初的用例是使用 Postgres 数据库，但为了可重现性，下面我使用内存中的 SQLite 具有完全相同的问题 - 当我 select 来自某些 table 的数据时（在 Postgres 中，该列被索引）参数化版本在 selecting 行数上的运行时间比 SQL 多倍，ID 粘贴在 IN 语句中：

library(DBI)
library(tictoc)

sample.data <- data.frame(
  id = 1:100000,
  value = rnorm(100000)
)

sqlite <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(
  sqlite, "sample_data",
  sample.data,
  overwrite = T
)

tic("Load by bind")
ids <- 50000:50100
res <- dbSendQuery(sqlite, "SELECT * FROM sample_data WHERE id = ")
dbBind(res, list(ids))
result <- dbFetch(res)
dbClearResult(res)
toc()

# Load by bind: 0.81 sec elapsed

tic("Load by paste")
ids <- 50000:50100
result2 <- dbGetQuery(sqlite, paste0("SELECT * FROM sample_data WHERE id IN (", paste(ids, collapse = ","), ")"))
toc()

# Load by paste: 0.04 sec elapsed

似乎我应该有一些明显的错误，因为准备好的查询应该更快（我确实在同一个 Postgres 示例中用 Python/SQLAlchemy 看到了它）。

Answer 1

你的第一个查询 ... id = 被执行了 101 次；您的第二个查询 ... id in (..) 执行一次。如果您在 DBMS 端进行审核（此处未进行演示），那么您会看到 101 个单独的查询。

前面，一个常见的错误是简化修改语句以使用 IN (?) 子句，

dbGetQuery(pgcon, "SELECT * FROM sample_data WHERE id in (?)", params = list(ids))

但这也执行了 101 次查询，感觉与 result1.

相同的性能问题

要使用更高效的 IN (..) 子句进行参数绑定，您需要提供那么多问号（或美元数字）。

bench::mark(
  result1 = dbGetQuery(sqlite, "SELECT * FROM sample_data WHERE id = ", params = list(ids)),
  result2 = dbGetQuery(sqlite, paste0("SELECT * FROM sample_data WHERE id IN (", idcommas, ")")),
  result3 = dbGetQuery(sqlite, paste0("SELECT * FROM sample_data WHERE id IN (", qmarks, ")"),
                       params = as.list(ids)),
  min_iterations = 50
)
# # A tibble: 3 x 13
#   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result             memory                  time          gc               
#   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>             <list>                  <list>        <list>           
# 1 result1    280.97ms  347.4ms      2.86    20.6KB        0    50     0      17.5s <df[,2] [101 x 2]> <Rprofmem[,3] [14 x 3]> <bch:tm [50]> <tibble [50 x 3]>
# 2 result2      7.31ms   8.21ms    115.      15.6KB        0    58     0    502.2ms <df[,2] [101 x 2]> <Rprofmem[,3] [12 x 3]> <bch:tm [58]> <tibble [58 x 3]>
# 3 result3      7.57ms   8.93ms    113.      28.4KB        0    57     0      506ms <df[,2] [101 x 2]> <Rprofmem[,3] [28 x 3]> <bch:tm [57]> <tibble [57 x 3]>

如果你很好奇，它在 postgres 实例上执行相同（显着更快）（尽管我将你的 </code> 更改为 <code>?： sqlite 两者都接受，[=74=] 只支持 qmarks):

pgcon <- dbConnect(odbc::odbc(), ...) # local docker postgres instance
bench::mark(...)
# # A tibble: 3 x 13
#   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result             memory                  time          gc               
#   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>             <list>                  <list>        <list>           
# 1 result1     967.4ms    1.05s     0.933    20.6KB        0    50     0     53.57s <df[,2] [101 x 2]> <Rprofmem[,3] [14 x 3]> <bch:tm [50]> <tibble [50 x 3]>
# 2 result2        57ms   67.7ms    14.4      18.1KB        0    50     0      3.47s <df[,2] [101 x 2]> <Rprofmem[,3] [13 x 3]> <bch:tm [50]> <tibble [50 x 3]>
# 3 result3      56.9ms  65.17ms    14.3      21.4KB        0    50     0       3.5s <df[,2] [101 x 2]> <Rprofmem[,3] [15 x 3]> <bch:tm [50]> <tibble [50 x 3]>

我也在 odbc/sql-server 上进行了测试，结果非常相似。

result2 和 result3 在所有三个 DBMS 上通常都非常接近，事实上在不同的采样上前者比后者快，所以我将它们的性能比较称为洗牌.那么，使用绑定的动机是什么？在许多情况下，这主要是学术讨论：大多数时候，您不使用它并没有做错什么（而是使用您的 paste(ids, collapse=",") 方法）。

但是：

不慎“sql注入”。从技术上讲，SQL 注入必须是恶意的才能被标记为这样，但我在 SQL 查询中非正式地归因于数据嵌入引号的“糟糕”时刻，并将其粘贴到静态查询字符串中，我打破了报价。对我来说幸运的是，它所做的只是破坏了查询的解析，我没有 deleted this year's student records.

一个常见的错误是尝试使用 sQuote 来转义嵌入的引号。长话短说：不，SQL 的做法不同。许多 SQL 用户不知道要转义嵌入的单引号，必须将其加倍：
```
sQuote("he's Irish")
# [1] "'he's Irish'"                      # WRONG

DBI::dbQuoteString(sqlite, "he's Irish")
# <SQL> 'he''s Irish'                     # RIGHT for both sqlite and pgcon
```
查询优化。大多数（全部？我不确定）DBMS 会进行某种形式的查询优化，试图利用索引 and/or 类似的措施。为了擅长它，这种优化是针对查询进行一次，然后进行缓存。但是，即使您更改了查询的一个字母，您也有缓存未命中的风险（我并不是说“总是”，因为我没有审核缓存代码……但前提是明确的，我认为）。这意味着将查询从 select * from mytable where a=1 更改为 ... a=2 确实不会获得缓存命中，并且（再次）对其进行了优化。

将其与 select * from mytable where a=? 与参数绑定进行对比，您将受益于缓存。

请注意，如果您的 ids 列表长度发生变化，则查询可能会重新优化（从 id in (?,?) 更改为 id in (?,?,?)）；如果那真的是缓存未命中，我不知道，同样没有审核 DBMS 代码。

顺便说一句：您提到的 “准备好的语句” 与此查询优化非常一致，但您遇到的性能损失更多是关于运行相同的查询比与缓存有关的任何查询都多 101 次 hits/misses.

R DBI绑定变量性能

R DBI bind variable performance

sqlite

postgresql

r

dbi