如何使用 ODBC 包从 SQL 服务器读取大型 table(>100 列(变量)和 100,000 个观察值)到 R
How to read a large table (>100 columns (variables) and 100,000 observations) from SQL Server into R using ODBC package
我在从 SQL 服务器读取大型 table 到 R 时遇到错误。
这是我的连接代码:
library(odbc)
library(DBI)
con <- dbConnect(odbc::odbc(),
.connection_string = 'driver={SQL Server};server=DW01;database=SFAF_DW;trusted_connection=true')
这是我的 table 的模式,它有 149 个变量:
data1 <- dbGetQuery(con, "SELECT * FROM [eCW].[Visits]")
我从这段代码中得到一个错误,可能是因为 table。
我想减少应用“VisitDateTime”变量的大 table(观察次数)。
data2 <- dbGetQuery(con, "SELECT cast(VisitDateTime as DATETIME) as VisitDateTime FROM [eCW].[Visits] WHERE VisitDateTime>='2019-07-01 00:00:00' AND VisitDateTime<='2020-06-30 12:00:00'")
此代码仅选择了“VisitDateTime”变量,但我想从 table.
中获取所有(149 个变量)
希望得到一些高效的代码。非常感谢您对此的帮助。谢谢。
根据您的架构,您有许多长度为 255 个字符的可变长度类型,varchar
。作为 similar error post suggests, you cannot rely on arbitrary order with SELECT *
but must explicitly reference each column and place variable lengths toward the end of SELECT
clause. In fact, generally in application code running SQL, avoid SELECT * FROM
. See Why is SELECT * considered harmful?
的多个答案
幸运的是,从使用 INFORMATION_SCHEMA.COLUMNS
的模式输出中,您可以为 SELECT
动态开发这样一个更大的命名列表。首先,将您的模式查询 运行 调整为具有计算列的 R 数据框,以对从最小到最大的类型及其 precision/lengths.
进行排序
schema_sql <- "SELECT sub.TABLE_NAME, sub.COLUMN_NAME, sub.DATA_TYPE, sub.SELECT_TYPE_ORDER
, sub.CHARACTER_MAXIMUM_LENGTH, sub.CHARACTER_OCTET_LENGTH
, sub.NUMERIC_PRECISION, sub.NUMERIC_PRECISION_RADIX, sub.NUMERIC_SCALE
FROM
(SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE
, CHARACTER_MAXIMUM_LENGTH, CHARACTER_OCTET_LENGTH
, NUMERIC_PRECISION, NUMERIC_PRECISION_RADIX, NUMERIC_SCALE
, CASE DATA_TYPE
WHEN 'tinyint' THEN 1
WHEN 'smallint' THEN 2
WHEN 'int' THEN 3
WHEN 'bigint' THEN 4
WHEN 'date' THEN 5
WHEN 'datetime' THEN 6
WHEN 'datetime2' THEN 7
WHEN 'decimal' THEN 8
WHEN 'varchar' THEN 9
WHEN 'nvarchar' THEN 10
END AS SELECT_TYPE_ORDER
FROM INFORMATION_SCHEMA.COLUMNS
WHERE SCHEMA_NAME = 'eCW'
AND TABLE_NAME = 'Visits'
) sub
ORDER BY sub.SELECT_TYPE_ORDER
, sub.NUMERIC_PRECISION
, sub.NUMERIC_PRECISION_RADIX
, sub.NUMERIC_SCALE
, sub.CHARACTER_MAXIMUM_LENGTH
, sub.CHARACTER_OCTET_LENGTH"
visits_schema_df <- dbGetQuery(con, schema_sql)
# BUILD COLUMN LIST FOR SELECT CLAUSE
select_columns <- paste0("[", paste(visits_schema_df$COLUMN_NAME, collapse="], ["), "]")
# RUN QUERY WITH EXPLICIT COLUMNS
data <- dbGetQuery(con, paste("SELECT", select_columns, "FROM [eCW].[Visits]"))
如果出现同样的错误,可能需要调整以上内容。通过隔离问题列、列类型等,积极主动地进行测试。一些建议包括过滤掉 DATA_TYPE
、COLUMN_NAME
或在模式查询中移动 ORDER
列。
...
FROM INFORMATION_SCHEMA.COLUMNS
WHERE SCHEMA_NAME = 'eCW'
AND TABLE_NAME = 'Visits'
AND DATA_TYPE IN ('tinyint', 'smallint', 'int') -- TEST WITH ONLY INTEGER TYPES
...
FROM INFORMATION_SCHEMA.COLUMNS
WHERE SCHEMA_NAME = 'eCW'
AND TABLE_NAME = 'Visits'
AND NOT DATA_TYPE IN ('varchar', 'nvarchar') -- TEST WITHOUT VARIABLE STRING TYPES
...
FROM INFORMATION_SCHEMA.COLUMNS
WHERE SCHEMA_NAME = 'eCW'
AND TABLE_NAME = 'Visits'
AND NOT DATA_TYPE IN ('decimal', 'datetime2') -- TEST WITHOUT HIGH PRECISION TYPES
...
FROM INFORMATION_SCHEMA.COLUMNS
WHERE SCHEMA_NAME = 'eCW'
AND TABLE_NAME = 'Visits'
AND NOT COLUMN_NAME IN ('LastHIVTestResult') -- TEST WITHOUT LARGE VARCHARs
...
ORDER BY sub.SELECT_TYPE_ORDER -- ADJUST ORDERING
, sub.NUMERIC_SCALE
, sub.NUMERIC_PRECISION
, sub.NUMERIC_PRECISION_RADIX
, sub.CHARACTER_OCTET_LENGTH
, sub.CHARACTER_MAXIMUM_LENGTH
另一种解决方案是使用主键上的 chain merge(假设为 DW_Id
)按类型(调整模式查询)将 R 数据框拼接在一起:
final_data <- Reduce(function(x, y) merge(x, y, by="DW_Id"),
list(data_int_columns, # SEPARATE QUERY RESULT WITH DW_Id AND INTs IN SELECT
data_num_columns, # SEPARATE QUERY RESULT WITH DW_Id AND DECIMALs IN SELECT
data_dt_columns, # SEPARATE QUERY RESULT WITH DW_Id AND DATE/TIMEs IN SELECT
data_char_columns) # SEPARATE QUERY RESULT WITH DW_Id AND VARCHARs IN SELECT
)
我在从 SQL 服务器读取大型 table 到 R 时遇到错误。
这是我的连接代码:
library(odbc)
library(DBI)
con <- dbConnect(odbc::odbc(),
.connection_string = 'driver={SQL Server};server=DW01;database=SFAF_DW;trusted_connection=true')
这是我的 table 的模式,它有 149 个变量:
data1 <- dbGetQuery(con, "SELECT * FROM [eCW].[Visits]")
我从这段代码中得到一个错误,可能是因为 table。
我想减少应用“VisitDateTime”变量的大 table(观察次数)。
data2 <- dbGetQuery(con, "SELECT cast(VisitDateTime as DATETIME) as VisitDateTime FROM [eCW].[Visits] WHERE VisitDateTime>='2019-07-01 00:00:00' AND VisitDateTime<='2020-06-30 12:00:00'")
此代码仅选择了“VisitDateTime”变量,但我想从 table.
中获取所有(149 个变量)希望得到一些高效的代码。非常感谢您对此的帮助。谢谢。
根据您的架构,您有许多长度为 255 个字符的可变长度类型,varchar
。作为 similar error post suggests, you cannot rely on arbitrary order with SELECT *
but must explicitly reference each column and place variable lengths toward the end of SELECT
clause. In fact, generally in application code running SQL, avoid SELECT * FROM
. See Why is SELECT * considered harmful?
幸运的是,从使用 INFORMATION_SCHEMA.COLUMNS
的模式输出中,您可以为 SELECT
动态开发这样一个更大的命名列表。首先,将您的模式查询 运行 调整为具有计算列的 R 数据框,以对从最小到最大的类型及其 precision/lengths.
schema_sql <- "SELECT sub.TABLE_NAME, sub.COLUMN_NAME, sub.DATA_TYPE, sub.SELECT_TYPE_ORDER
, sub.CHARACTER_MAXIMUM_LENGTH, sub.CHARACTER_OCTET_LENGTH
, sub.NUMERIC_PRECISION, sub.NUMERIC_PRECISION_RADIX, sub.NUMERIC_SCALE
FROM
(SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE
, CHARACTER_MAXIMUM_LENGTH, CHARACTER_OCTET_LENGTH
, NUMERIC_PRECISION, NUMERIC_PRECISION_RADIX, NUMERIC_SCALE
, CASE DATA_TYPE
WHEN 'tinyint' THEN 1
WHEN 'smallint' THEN 2
WHEN 'int' THEN 3
WHEN 'bigint' THEN 4
WHEN 'date' THEN 5
WHEN 'datetime' THEN 6
WHEN 'datetime2' THEN 7
WHEN 'decimal' THEN 8
WHEN 'varchar' THEN 9
WHEN 'nvarchar' THEN 10
END AS SELECT_TYPE_ORDER
FROM INFORMATION_SCHEMA.COLUMNS
WHERE SCHEMA_NAME = 'eCW'
AND TABLE_NAME = 'Visits'
) sub
ORDER BY sub.SELECT_TYPE_ORDER
, sub.NUMERIC_PRECISION
, sub.NUMERIC_PRECISION_RADIX
, sub.NUMERIC_SCALE
, sub.CHARACTER_MAXIMUM_LENGTH
, sub.CHARACTER_OCTET_LENGTH"
visits_schema_df <- dbGetQuery(con, schema_sql)
# BUILD COLUMN LIST FOR SELECT CLAUSE
select_columns <- paste0("[", paste(visits_schema_df$COLUMN_NAME, collapse="], ["), "]")
# RUN QUERY WITH EXPLICIT COLUMNS
data <- dbGetQuery(con, paste("SELECT", select_columns, "FROM [eCW].[Visits]"))
如果出现同样的错误,可能需要调整以上内容。通过隔离问题列、列类型等,积极主动地进行测试。一些建议包括过滤掉 DATA_TYPE
、COLUMN_NAME
或在模式查询中移动 ORDER
列。
...
FROM INFORMATION_SCHEMA.COLUMNS
WHERE SCHEMA_NAME = 'eCW'
AND TABLE_NAME = 'Visits'
AND DATA_TYPE IN ('tinyint', 'smallint', 'int') -- TEST WITH ONLY INTEGER TYPES
...
FROM INFORMATION_SCHEMA.COLUMNS
WHERE SCHEMA_NAME = 'eCW'
AND TABLE_NAME = 'Visits'
AND NOT DATA_TYPE IN ('varchar', 'nvarchar') -- TEST WITHOUT VARIABLE STRING TYPES
...
FROM INFORMATION_SCHEMA.COLUMNS
WHERE SCHEMA_NAME = 'eCW'
AND TABLE_NAME = 'Visits'
AND NOT DATA_TYPE IN ('decimal', 'datetime2') -- TEST WITHOUT HIGH PRECISION TYPES
...
FROM INFORMATION_SCHEMA.COLUMNS
WHERE SCHEMA_NAME = 'eCW'
AND TABLE_NAME = 'Visits'
AND NOT COLUMN_NAME IN ('LastHIVTestResult') -- TEST WITHOUT LARGE VARCHARs
...
ORDER BY sub.SELECT_TYPE_ORDER -- ADJUST ORDERING
, sub.NUMERIC_SCALE
, sub.NUMERIC_PRECISION
, sub.NUMERIC_PRECISION_RADIX
, sub.CHARACTER_OCTET_LENGTH
, sub.CHARACTER_MAXIMUM_LENGTH
另一种解决方案是使用主键上的 chain merge(假设为 DW_Id
)按类型(调整模式查询)将 R 数据框拼接在一起:
final_data <- Reduce(function(x, y) merge(x, y, by="DW_Id"),
list(data_int_columns, # SEPARATE QUERY RESULT WITH DW_Id AND INTs IN SELECT
data_num_columns, # SEPARATE QUERY RESULT WITH DW_Id AND DECIMALs IN SELECT
data_dt_columns, # SEPARATE QUERY RESULT WITH DW_Id AND DATE/TIMEs IN SELECT
data_char_columns) # SEPARATE QUERY RESULT WITH DW_Id AND VARCHARs IN SELECT
)