如何在R中批量查询Id的个数
How to query number of Ids in batches in R
我在 R 中有下面提到的数据框。
ID Amount Date
IK-1 100 2020-01-01
IK-2 110 2020-01-02
IK-3 120 2020-01-03
IK-4 109 2020-01-03
IK-5 104 2020-01-03
我正在使用 ID
通过以下代码从 MySQL 获取一些详细信息。
library(RMySQL)
conn<- connection
query<-paste0("SELECT c.ID,e.Parameters, d.status
FROM Table1 c
left outer join Table2 d ON d.seq_id=c.ID
LEFT outer JOIN Table3 e ON e.role_id=d.role
where c.ID IN (", paste(shQuote(dataframe$ID, type = "sh"),
collapse = ', '),")
and e.Parameters in
('Section1',
'Section2','Section3',
'Section4');")
res1 <- dbGetQuery(conn,query)
res2<-res1[res1$Parameters=="Section1",4:5]
colnames(res2)[colnames(res2)=="status"] <- "Section1_Status"
上面的代码工作正常,如果我传递了 ~1000 个 ID,但是当一次传递 10000 个或更多 ID 时它会抛出 R 终止错误。
如何创建循环并批量传递 Id 以获得 10000 ID 的最终输出。
错误信息:
Warning message:
In dbFetch(rs, n = n, ...) : error while fetching rows
@A.Suliman 的链接表明这很可能是因为您的 IN 子句中包含大量值。以下是一些可以尝试的解决方案:
批处理
我喜欢使用模数来进行批处理。这假设您要批处理的 ID 值是数字:
num_batches = 100
output_list = list()
for(i in 1:num_batches){
this_subset = filter(dataframe, ID %% num_batches == (i-1))
# subsequent processing using this_subset
output_list[i] = results_from_subsetting
}
output = data.table::rbindlist(output_list)
在您的情况下,ID 看起来采用 XX-123
形式(两个字符,一个连字符,后跟一些数字)。您可以使用以下方法将其转换为数字:just_number_part = substr(ID, 4, nchar(ID))
.
正在写入临时文件
如果您要将 dataframe
从 R 写入 sql,那么您将不需要这么大的 IN 子句,而是可以使用连接。 dbplyr
包包含一个函数 copy_to
,可用于将临时表写入数据库。
这看起来像:
library(RMySQL)
library(dbplyr)
conn<- connection
copy_to(conn, dataframe, name = "my_table_name") # copy local table to mysql
query<-paste0("SELECT c.ID,e.Parameters, d.status
FROM Table1 c
INNER JOIN my_table_name a ON a.ID = c.ID # replace IN-clause with inner join
left outer join Table2 d ON d.seq_id=c.ID
LEFT outer JOIN Table3 e ON e.role_id=d.role
WHERE e.Parameters in
('Section1',
'Section2','Section3',
'Section4');")
res1 <- dbGetQuery(conn,query)
作为参考,我建议 the tidyverse documentation. You might also find this question 编写使用 copy_to
有助于调试。
增加超时延迟
当 IN 子句中有很多值时,查询的执行速度会变慢,因为 IN 子句本质上被转换为一系列 OR 语句。
根据 this link,您可以通过以下方式更改 MySQL 的超时选项:
- 编辑您的 my.cnf(MySQL 配置文件)
- 添加超时配置并调整它以适合您的服务器。
wait_timeout = 28800
interactive_timeout = 28800
- 重启MySQL
在您的 SQL 查询之前将 ID 的数据帧传递到临时 table,然后使用它来对您正在使用的 ID 进行内部联接,这样可以避免循环。您所要做的就是使用 dbWriteTable
并在调用时设置参数 temporary = TRUE
。
例如:
library(DBI)
library(RMySQL)
con <- dbConnect(RMySQL::MySQL(), user='user',
password='password', dbname='database_name', host='host')
#here we write the table into the DB and then declare it as temporary
dbWriteTable(conn = con, value = dataframe, name = "id_frame", temporary = T)
res1 <- dbGetQuery(con = conn, "SELECT c.ID,e.Parameters, d.status
FROM Table1 c
left outer join Table2 d ON d.seq_id=c.ID
LEFT outer JOIN Table3 e ON e.role_id=d.role
Inner join id_frame idf on idf.ID = c.ID
and e.Parameters in
('Section1',
'Section2','Section3',
'Section4');")
这应该会提高代码的性能,而且您将不再需要使用 where 语句在 R 中循环。如果它不能正常工作,请告诉我。
# Load Packages
library(dplyr) # only needed to create the initial dataframe
library(RMySQL)
# create the initial dataframe
df <- tribble(
~ID, ~Amount, ~Date
, "IK-1" , 100 , 2020-01-01
, "IK-2" , 110 , 2020-01-02
, "IK-3" , 120 , 2020-01-03
, "IK-4" , 109 , 2020-01-03
, "IK-5" , 104 , 2020-01-03
)
# first helper function
createIDBatchVector <- function(x, batchSize){
paste0(
"'"
, sapply(
split(x, ceiling(seq_along(x) / batchSize))
, paste
, collapse = "','"
)
, "'"
)
}
# second helper function
createQueries <- function(IDbatches){
paste0("
SELECT c.ID,e.Parameters, d.status
FROM Table1 c
LEFT OUTER JOIN Table2 d ON d.seq_id =c.ID
LEFT OUTER JOIN Table3 e ON e.role_id = d.role
WHERE c.ID IN (", IDbatches,")
AND e.Parameters in ('Section1','Section2','Section3','Section4');
")
}
# ------------------------------------------------------------------
# and now the actual script
# first we create a vector that contains one batch per element
IDbatches <- createIDBatchVector(df$ID, 2)
# It looks like this:
# [1] "'IK-1','IK-2'" "'IK-3','IK-4'" "'IK-5'"
# now we create a vector of SQL-queries out of that
queries <- createQueries(IDbatches)
cat(queries) # use cat to show what they look like
# it looks like this:
# SELECT c.ID,e.Parameters, d.status
# FROM Table1 c
# LEFT OUTER JOIN Table2 d ON d.seq_id =c.ID
# LEFT OUTER JOIN Table3 e ON e.role_id = d.role
# WHERE c.ID IN ('IK-1','IK-2')
# AND e.Parameters in ('Section1','Section2','Section3','Section4');
#
# SELECT c.ID,e.Parameters, d.status
# FROM Table1 c
# LEFT OUTER JOIN Table2 d ON d.seq_id =c.ID
# LEFT OUTER JOIN Table3 e ON e.role_id = d.role
# WHERE c.ID IN ('IK-3','IK-4')
# AND e.Parameters in ('Section1','Section2','Section3','Section4');
#
# SELECT c.ID,e.Parameters, d.status
# FROM Table1 c
# LEFT OUTER JOIN Table2 d ON d.seq_id =c.ID
# LEFT OUTER JOIN Table3 e ON e.role_id = d.role
# WHERE c.ID IN ('IK-5')
# AND e.Parameters in ('Section1','Section2','Section3','Section4');
# and now the loop
df_final <- data.frame() # initialize a dataframe
conn <- connection # open a connection
for (query in queries){ # iterate over the queries
df_final <- rbind(df_final, dbGetQuery(conn,query))
}
# And here the connection should be closed. (I don't know the function call for this.)
也许只是尝试...
根据上述评论,MySQL IN (...)
条件中可能存在大小限制。
也许你可以通过在子列表中拆分整个 dataframe$ID
s 列表并使用如下条件重写你的查询来绕过它:
WHERE c.ID IN sublist#1
OR c.ID IN sublist#2
OR c.ID IN sublist#3
...
而不是唯一的 c.ID IN list
?
假设我们创建最大长度为 1000 的子列表,它可以给出:
sublists <- split(dataframe$ID, ceiling(seq_along(dataframe$ID)/1000))
然后,您可以构建一个像 "OR c.ID IN (...) OR c.ID IN (...) OR c.ID IN (...) ...
这样的字符串
插入您的代码,将给出:
library(RMySQL)
conn<- connection
sublists <- split(dataframe$ID, ceiling(seq_along(dataframe$ID)/1000))
query <- paste0("SELECT c.ID,e.Parameters, d.status
FROM Table1 c
left outer join Table2 d ON d.seq_id=c.ID
LEFT outer JOIN Table3 e ON e.role_id=d.role
where 1 = 1 AND (", # to get rid of the "where"
paste(lapply(sublists,
FUN = function(x){
paste0("OR c.ID IN (", paste(shQuote(x, type = "sh"), collapse = ', '), ")")
}),
collapse = "\n"), ")
and e.Parameters in
('Section1',
'Section2','Section3',
'Section4');") %>% cat
res1 <- dbGetQuery(conn,query)
res2<-res1[res1$Parameters=="Section1",4:5]
colnames(res2)[colnames(res2)=="status"] <- "Section1_Status"
我在 R 中有下面提到的数据框。
ID Amount Date
IK-1 100 2020-01-01
IK-2 110 2020-01-02
IK-3 120 2020-01-03
IK-4 109 2020-01-03
IK-5 104 2020-01-03
我正在使用 ID
通过以下代码从 MySQL 获取一些详细信息。
library(RMySQL)
conn<- connection
query<-paste0("SELECT c.ID,e.Parameters, d.status
FROM Table1 c
left outer join Table2 d ON d.seq_id=c.ID
LEFT outer JOIN Table3 e ON e.role_id=d.role
where c.ID IN (", paste(shQuote(dataframe$ID, type = "sh"),
collapse = ', '),")
and e.Parameters in
('Section1',
'Section2','Section3',
'Section4');")
res1 <- dbGetQuery(conn,query)
res2<-res1[res1$Parameters=="Section1",4:5]
colnames(res2)[colnames(res2)=="status"] <- "Section1_Status"
上面的代码工作正常,如果我传递了 ~1000 个 ID,但是当一次传递 10000 个或更多 ID 时它会抛出 R 终止错误。
如何创建循环并批量传递 Id 以获得 10000 ID 的最终输出。
错误信息:
Warning message:
In dbFetch(rs, n = n, ...) : error while fetching rows
@A.Suliman 的链接表明这很可能是因为您的 IN 子句中包含大量值。以下是一些可以尝试的解决方案:
批处理
我喜欢使用模数来进行批处理。这假设您要批处理的 ID 值是数字:
num_batches = 100
output_list = list()
for(i in 1:num_batches){
this_subset = filter(dataframe, ID %% num_batches == (i-1))
# subsequent processing using this_subset
output_list[i] = results_from_subsetting
}
output = data.table::rbindlist(output_list)
在您的情况下,ID 看起来采用 XX-123
形式(两个字符,一个连字符,后跟一些数字)。您可以使用以下方法将其转换为数字:just_number_part = substr(ID, 4, nchar(ID))
.
正在写入临时文件
如果您要将 dataframe
从 R 写入 sql,那么您将不需要这么大的 IN 子句,而是可以使用连接。 dbplyr
包包含一个函数 copy_to
,可用于将临时表写入数据库。
这看起来像:
library(RMySQL)
library(dbplyr)
conn<- connection
copy_to(conn, dataframe, name = "my_table_name") # copy local table to mysql
query<-paste0("SELECT c.ID,e.Parameters, d.status
FROM Table1 c
INNER JOIN my_table_name a ON a.ID = c.ID # replace IN-clause with inner join
left outer join Table2 d ON d.seq_id=c.ID
LEFT outer JOIN Table3 e ON e.role_id=d.role
WHERE e.Parameters in
('Section1',
'Section2','Section3',
'Section4');")
res1 <- dbGetQuery(conn,query)
作为参考,我建议 the tidyverse documentation. You might also find this question 编写使用 copy_to
有助于调试。
增加超时延迟
当 IN 子句中有很多值时,查询的执行速度会变慢,因为 IN 子句本质上被转换为一系列 OR 语句。
根据 this link,您可以通过以下方式更改 MySQL 的超时选项:
- 编辑您的 my.cnf(MySQL 配置文件)
- 添加超时配置并调整它以适合您的服务器。
wait_timeout = 28800
interactive_timeout = 28800
- 重启MySQL
在您的 SQL 查询之前将 ID 的数据帧传递到临时 table,然后使用它来对您正在使用的 ID 进行内部联接,这样可以避免循环。您所要做的就是使用 dbWriteTable
并在调用时设置参数 temporary = TRUE
。
例如:
library(DBI)
library(RMySQL)
con <- dbConnect(RMySQL::MySQL(), user='user',
password='password', dbname='database_name', host='host')
#here we write the table into the DB and then declare it as temporary
dbWriteTable(conn = con, value = dataframe, name = "id_frame", temporary = T)
res1 <- dbGetQuery(con = conn, "SELECT c.ID,e.Parameters, d.status
FROM Table1 c
left outer join Table2 d ON d.seq_id=c.ID
LEFT outer JOIN Table3 e ON e.role_id=d.role
Inner join id_frame idf on idf.ID = c.ID
and e.Parameters in
('Section1',
'Section2','Section3',
'Section4');")
这应该会提高代码的性能,而且您将不再需要使用 where 语句在 R 中循环。如果它不能正常工作,请告诉我。
# Load Packages
library(dplyr) # only needed to create the initial dataframe
library(RMySQL)
# create the initial dataframe
df <- tribble(
~ID, ~Amount, ~Date
, "IK-1" , 100 , 2020-01-01
, "IK-2" , 110 , 2020-01-02
, "IK-3" , 120 , 2020-01-03
, "IK-4" , 109 , 2020-01-03
, "IK-5" , 104 , 2020-01-03
)
# first helper function
createIDBatchVector <- function(x, batchSize){
paste0(
"'"
, sapply(
split(x, ceiling(seq_along(x) / batchSize))
, paste
, collapse = "','"
)
, "'"
)
}
# second helper function
createQueries <- function(IDbatches){
paste0("
SELECT c.ID,e.Parameters, d.status
FROM Table1 c
LEFT OUTER JOIN Table2 d ON d.seq_id =c.ID
LEFT OUTER JOIN Table3 e ON e.role_id = d.role
WHERE c.ID IN (", IDbatches,")
AND e.Parameters in ('Section1','Section2','Section3','Section4');
")
}
# ------------------------------------------------------------------
# and now the actual script
# first we create a vector that contains one batch per element
IDbatches <- createIDBatchVector(df$ID, 2)
# It looks like this:
# [1] "'IK-1','IK-2'" "'IK-3','IK-4'" "'IK-5'"
# now we create a vector of SQL-queries out of that
queries <- createQueries(IDbatches)
cat(queries) # use cat to show what they look like
# it looks like this:
# SELECT c.ID,e.Parameters, d.status
# FROM Table1 c
# LEFT OUTER JOIN Table2 d ON d.seq_id =c.ID
# LEFT OUTER JOIN Table3 e ON e.role_id = d.role
# WHERE c.ID IN ('IK-1','IK-2')
# AND e.Parameters in ('Section1','Section2','Section3','Section4');
#
# SELECT c.ID,e.Parameters, d.status
# FROM Table1 c
# LEFT OUTER JOIN Table2 d ON d.seq_id =c.ID
# LEFT OUTER JOIN Table3 e ON e.role_id = d.role
# WHERE c.ID IN ('IK-3','IK-4')
# AND e.Parameters in ('Section1','Section2','Section3','Section4');
#
# SELECT c.ID,e.Parameters, d.status
# FROM Table1 c
# LEFT OUTER JOIN Table2 d ON d.seq_id =c.ID
# LEFT OUTER JOIN Table3 e ON e.role_id = d.role
# WHERE c.ID IN ('IK-5')
# AND e.Parameters in ('Section1','Section2','Section3','Section4');
# and now the loop
df_final <- data.frame() # initialize a dataframe
conn <- connection # open a connection
for (query in queries){ # iterate over the queries
df_final <- rbind(df_final, dbGetQuery(conn,query))
}
# And here the connection should be closed. (I don't know the function call for this.)
也许只是尝试...
根据上述评论,MySQL IN (...)
条件中可能存在大小限制。
也许你可以通过在子列表中拆分整个 dataframe$ID
s 列表并使用如下条件重写你的查询来绕过它:
WHERE c.ID IN sublist#1
OR c.ID IN sublist#2
OR c.ID IN sublist#3
...
而不是唯一的 c.ID IN list
?
假设我们创建最大长度为 1000 的子列表,它可以给出:
sublists <- split(dataframe$ID, ceiling(seq_along(dataframe$ID)/1000))
然后,您可以构建一个像 "OR c.ID IN (...) OR c.ID IN (...) OR c.ID IN (...) ...
插入您的代码,将给出:
library(RMySQL)
conn<- connection
sublists <- split(dataframe$ID, ceiling(seq_along(dataframe$ID)/1000))
query <- paste0("SELECT c.ID,e.Parameters, d.status
FROM Table1 c
left outer join Table2 d ON d.seq_id=c.ID
LEFT outer JOIN Table3 e ON e.role_id=d.role
where 1 = 1 AND (", # to get rid of the "where"
paste(lapply(sublists,
FUN = function(x){
paste0("OR c.ID IN (", paste(shQuote(x, type = "sh"), collapse = ', '), ")")
}),
collapse = "\n"), ")
and e.Parameters in
('Section1',
'Section2','Section3',
'Section4');") %>% cat
res1 <- dbGetQuery(conn,query)
res2<-res1[res1$Parameters=="Section1",4:5]
colnames(res2)[colnames(res2)=="status"] <- "Section1_Status"