如何在R中批量查询Id的个数

How to query number of Ids in batches in R

我在 R 中有下面提到的数据框。

ID       Amount     Date
IK-1     100        2020-01-01
IK-2     110        2020-01-02
IK-3     120        2020-01-03
IK-4     109        2020-01-03
IK-5     104        2020-01-03

我正在使用 ID 通过以下代码从 MySQL 获取一些详细信息。

library(RMySQL)

conn<- connection

query<-paste0("SELECT c.ID,e.Parameters, d.status
FROM Table1 c
left outer join Table2 d ON d.seq_id=c.ID
LEFT outer JOIN Table3 e ON e.role_id=d.role
           where c.ID IN (", paste(shQuote(dataframe$ID, type = "sh"),
                                      collapse = ', '),") 
and e.Parameters in
           ('Section1',
           'Section2','Section3',
           'Section4');")

res1 <- dbGetQuery(conn,query)

res2<-res1[res1$Parameters=="Section1",4:5]
colnames(res2)[colnames(res2)=="status"] <- "Section1_Status"

上面的代码工作正常,如果我传递了 ~1000 个 ID,但是当一次传递 10000 个或更多 ID 时它会抛出 R 终止错误。

如何创建循环并批量传递 Id 以获得 10000 ID 的最终输出。

错误信息:

Warning message:
In dbFetch(rs, n = n, ...) : error while fetching rows

@A.Suliman 的链接表明这很可能是因为您的 IN 子句中包含大量值。以下是一些可以尝试的解决方案:

批处理

我喜欢使用模数来进行批处理。这假设您要批处理的 ID 值是数字:

num_batches = 100
output_list = list()

for(i in 1:num_batches){
    this_subset = filter(dataframe, ID %% num_batches == (i-1))

    # subsequent processing using this_subset

    output_list[i] = results_from_subsetting
}
output = data.table::rbindlist(output_list)

在您的情况下,ID 看起来采用 XX-123 形式(两个字符,一个连字符,后跟一些数字)。您可以使用以下方法将其转换为数字:just_number_part = substr(ID, 4, nchar(ID)).

正在写入临时文件

如果您要将 dataframe 从 R 写入 sql,那么您将不需要这么大的 IN 子句,而是可以使用连接。 dbplyr 包包含一个函数 copy_to,可用于将临时表写入数据库。

这看起来像:

library(RMySQL)
library(dbplyr)

conn<- connection

copy_to(conn, dataframe, name = "my_table_name") # copy local table to mysql

query<-paste0("SELECT c.ID,e.Parameters, d.status
FROM Table1 c
INNER JOIN my_table_name a ON a.ID = c.ID # replace IN-clause with inner join
left outer join Table2 d ON d.seq_id=c.ID
LEFT outer JOIN Table3 e ON e.role_id=d.role
WHERE e.Parameters in
           ('Section1',
           'Section2','Section3',
           'Section4');")

res1 <- dbGetQuery(conn,query)

作为参考,我建议 the tidyverse documentation. You might also find this question 编写使用 copy_to 有助于调试。

增加超时延迟

当 IN 子句中有很多值时,查询的执行速度会变慢,因为 IN 子句本质上被转换为一系列 OR 语句。

根据 this link,您可以通过以下方式更改 MySQL 的超时选项:

  • 编辑您的 my.cnf(MySQL 配置文件)
  • 添加超时配置并调整它以适合您的服务器。
    • wait_timeout = 28800
    • interactive_timeout = 28800
  • 重启MySQL

在您的 SQL 查询之前将 ID 的数据帧传递到临时 table,然后使用它来对您正在使用的 ID 进行内部联接,这样可以避免循环。您所要做的就是使用 dbWriteTable 并在调用时设置参数 temporary = TRUE

例如:

library(DBI)
library(RMySQL)
con <- dbConnect(RMySQL::MySQL(), user='user', 
password='password', dbname='database_name', host='host')
#here we write the table into the DB and then declare it as temporary
dbWriteTable(conn = con, value = dataframe, name = "id_frame", temporary = T)
res1 <- dbGetQuery(con = conn, "SELECT c.ID,e.Parameters, d.status
FROM Table1 c
left outer join Table2 d ON d.seq_id=c.ID
LEFT outer JOIN Table3 e ON e.role_id=d.role
Inner join id_frame idf on idf.ID = c.ID 
and e.Parameters in
       ('Section1',
       'Section2','Section3',
       'Section4');")

这应该会提高代码的性能,而且您将不再需要使用 where 语句在 R 中循环。如果它不能正常工作,请告诉我。

# Load Packages
library(dplyr) # only needed to create the initial dataframe
library(RMySQL)

# create the initial dataframe
df <- tribble(
    ~ID,       ~Amount,     ~Date
    , "IK-1"    , 100       , 2020-01-01
    , "IK-2"    , 110       , 2020-01-02
    , "IK-3"    , 120       , 2020-01-03
    , "IK-4"    , 109       , 2020-01-03
    , "IK-5"    , 104       , 2020-01-03
)

# first helper function
createIDBatchVector <- function(x, batchSize){
    paste0(
        "'"
        , sapply(
            split(x, ceiling(seq_along(x) / batchSize))
            , paste
            , collapse = "','"
        )
        , "'"
    )
}

# second helper function
createQueries <- function(IDbatches){
    paste0("
SELECT c.ID,e.Parameters, d.status
FROM Table1 c
    LEFT OUTER JOIN Table2 d ON d.seq_id =c.ID
    LEFT OUTER JOIN Table3 e ON e.role_id = d.role
WHERE c.ID IN (", IDbatches,") 
AND e.Parameters in ('Section1','Section2','Section3','Section4');
")
}

# ------------------------------------------------------------------

# and now the actual script

# first we create a vector that contains one batch per element
IDbatches <- createIDBatchVector(df$ID, 2)

# It looks like this:
# [1] "'IK-1','IK-2'" "'IK-3','IK-4'" "'IK-5'" 

# now we create a vector of SQL-queries out of that
queries <- createQueries(IDbatches)

cat(queries) # use cat to show what they look like

# it looks like this:

# SELECT c.ID,e.Parameters, d.status
# FROM Table1 c
#     LEFT OUTER JOIN Table2 d ON d.seq_id =c.ID
#     LEFT OUTER JOIN Table3 e ON e.role_id = d.role
# WHERE c.ID IN ('IK-1','IK-2') 
# AND e.Parameters in ('Section1','Section2','Section3','Section4');
#  
# SELECT c.ID,e.Parameters, d.status
# FROM Table1 c
#     LEFT OUTER JOIN Table2 d ON d.seq_id =c.ID
#     LEFT OUTER JOIN Table3 e ON e.role_id = d.role
# WHERE c.ID IN ('IK-3','IK-4') 
# AND e.Parameters in ('Section1','Section2','Section3','Section4');
#  
# SELECT c.ID,e.Parameters, d.status
# FROM Table1 c
#     LEFT OUTER JOIN Table2 d ON d.seq_id =c.ID
#     LEFT OUTER JOIN Table3 e ON e.role_id = d.role
# WHERE c.ID IN ('IK-5') 
# AND e.Parameters in ('Section1','Section2','Section3','Section4');

# and now the loop
df_final <- data.frame() # initialize a dataframe

conn <- connection # open a connection
for (query in queries){ # iterate over the queries
    df_final <- rbind(df_final, dbGetQuery(conn,query))
}

# And here the connection should be closed. (I don't know the function call for this.)

也许只是尝试...

根据上述评论,MySQL IN (...) 条件中可能存在大小限制。 也许你可以通过在子列表中拆分整个 dataframe$IDs 列表并使用如下条件重写你的查询来绕过它:

WHERE c.ID IN sublist#1
OR c.ID IN sublist#2
OR c.ID IN sublist#3
...

而不是唯一的 c.ID IN list ?

假设我们创建最大长度为 1000 的子列表,它可以给出:

sublists <- split(dataframe$ID, ceiling(seq_along(dataframe$ID)/1000))

然后,您可以构建一个像 "OR c.ID IN (...) OR c.ID IN (...) OR c.ID IN (...) ...

这样的字符串

插入您的代码,将给出:

library(RMySQL)
conn<- connection
sublists <- split(dataframe$ID, ceiling(seq_along(dataframe$ID)/1000))

query <- paste0("SELECT c.ID,e.Parameters, d.status
FROM Table1 c
left outer join Table2 d ON d.seq_id=c.ID
LEFT outer JOIN Table3 e ON e.role_id=d.role
           where 1 = 1 AND (", # to get rid of the "where"
       paste(lapply(sublists, 
                    FUN = function(x){
                      paste0("OR c.ID IN (",  paste(shQuote(x, type = "sh"), collapse = ', '), ")")
                    }), 
             collapse = "\n"), ")
and e.Parameters in
           ('Section1',
           'Section2','Section3',
           'Section4');") %>% cat

res1 <- dbGetQuery(conn,query)

res2<-res1[res1$Parameters=="Section1",4:5]
colnames(res2)[colnames(res2)=="status"] <- "Section1_Status"