如何使用 ODBC 包从 SQL 服务器读取大型 table(>100 列(变量)和 100,000 个观察值)到 R

How to read a large table (>100 columns (variables) and 100,000 observations) from SQL Server into R using ODBC package

我在从 SQL 服务器读取大型 table 到 R 时遇到错误。

这是我的连接代码:

library(odbc)
library(DBI)
con <- dbConnect(odbc::odbc(), 
     .connection_string = 'driver={SQL Server};server=DW01;database=SFAF_DW;trusted_connection=true')

这是我的 table 的模式,它有 149 个变量:

data1 <- dbGetQuery(con, "SELECT * FROM [eCW].[Visits]")

我从这段代码中得到一个错误,可能是因为 table。

我想减少应用“VisitDateTime”变量的大 table(观察次数)。

data2 <- dbGetQuery(con, "SELECT cast(VisitDateTime as DATETIME) as VisitDateTime FROM [eCW].[Visits] WHERE VisitDateTime>='2019-07-01 00:00:00' AND VisitDateTime<='2020-06-30 12:00:00'")

此代码仅选择了“VisitDateTime”变量,但我想从 table.

中获取所有(149 个变量)

希望得到一些高效的代码。非常感谢您对此的帮助。谢谢。

根据您的架构,您有许多长度为 255 个字符的可变长度类型,varchar。作为 similar error post suggests, you cannot rely on arbitrary order with SELECT * but must explicitly reference each column and place variable lengths toward the end of SELECT clause. In fact, generally in application code running SQL, avoid SELECT * FROM. See Why is SELECT * considered harmful?

的多个答案

幸运的是,从使用 INFORMATION_SCHEMA.COLUMNS 的模式输出中,您可以为 SELECT 动态开发这样一个更大的命名列表。首先,将您的模式查询 运行 调整为具有计算列的 R 数据框,以对从最小到最大的类型及其 precision/lengths.

进行排序
schema_sql <- "SELECT sub.TABLE_NAME, sub.COLUMN_NAME, sub.DATA_TYPE, sub.SELECT_TYPE_ORDER
                    , sub.CHARACTER_MAXIMUM_LENGTH, sub.CHARACTER_OCTET_LENGTH
                    , sub.NUMERIC_PRECISION, sub.NUMERIC_PRECISION_RADIX, sub.NUMERIC_SCALE
               FROM 
                  (SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE 
                        , CHARACTER_MAXIMUM_LENGTH, CHARACTER_OCTET_LENGTH
                        , NUMERIC_PRECISION, NUMERIC_PRECISION_RADIX, NUMERIC_SCALE
                        , CASE DATA_TYPE
                                WHEN 'tinyint'   THEN 1
                                WHEN 'smallint'  THEN 2
                                WHEN 'int'       THEN 3
                                WHEN 'bigint'    THEN 4
                                WHEN 'date'      THEN 5
                                WHEN 'datetime'  THEN 6
                                WHEN 'datetime2' THEN 7
                                WHEN 'decimal'   THEN 8
                                WHEN 'varchar'   THEN 9
                                WHEN 'nvarchar'  THEN 10
                          END AS SELECT_TYPE_ORDER
                   FROM INFORMATION_SCHEMA.COLUMNS
                   WHERE SCHEMA_NAME = 'eCW'
                     AND TABLE_NAME = 'Visits'
                  ) sub
               ORDER BY sub.SELECT_TYPE_ORDER
                      , sub.NUMERIC_PRECISION
                      , sub.NUMERIC_PRECISION_RADIX
                      , sub.NUMERIC_SCALE
                      , sub.CHARACTER_MAXIMUM_LENGTH
                      , sub.CHARACTER_OCTET_LENGTH"

visits_schema_df <- dbGetQuery(con, schema_sql)

# BUILD COLUMN LIST FOR SELECT CLAUSE
select_columns <- paste0("[", paste(visits_schema_df$COLUMN_NAME, collapse="], ["), "]")

# RUN QUERY WITH EXPLICIT COLUMNS
data <- dbGetQuery(con, paste("SELECT", select_columns, "FROM [eCW].[Visits]"))

如果出现同样的错误,可能需要调整以上内容。通过隔离问题列、列类型等,积极主动地进行测试。一些建议包括过滤掉 DATA_TYPECOLUMN_NAME 或在模式查询中移动 ORDER 列。

...
FROM INFORMATION_SCHEMA.COLUMNS
WHERE SCHEMA_NAME = 'eCW'
  AND TABLE_NAME = 'Visits'
  AND DATA_TYPE IN ('tinyint', 'smallint', 'int')  -- TEST WITH ONLY INTEGER TYPES
...
FROM INFORMATION_SCHEMA.COLUMNS
WHERE SCHEMA_NAME = 'eCW'
  AND TABLE_NAME = 'Visits'
  AND NOT DATA_TYPE IN ('varchar', 'nvarchar')     -- TEST WITHOUT VARIABLE STRING TYPES
...
FROM INFORMATION_SCHEMA.COLUMNS
WHERE SCHEMA_NAME = 'eCW'
  AND TABLE_NAME = 'Visits'
  AND NOT DATA_TYPE IN ('decimal', 'datetime2')    -- TEST WITHOUT HIGH PRECISION TYPES
...
FROM INFORMATION_SCHEMA.COLUMNS
WHERE SCHEMA_NAME = 'eCW'
  AND TABLE_NAME = 'Visits'
  AND NOT COLUMN_NAME IN ('LastHIVTestResult')     -- TEST WITHOUT LARGE VARCHARs
...
ORDER BY sub.SELECT_TYPE_ORDER                         -- ADJUST ORDERING
       , sub.NUMERIC_SCALE                             
       , sub.NUMERIC_PRECISION
       , sub.NUMERIC_PRECISION_RADIX
       , sub.CHARACTER_OCTET_LENGTH
       , sub.CHARACTER_MAXIMUM_LENGTH

另一种解决方案是使用主键上的 chain merge(假设为 DW_Id)按类型(调整模式查询)将 R 数据框拼接在一起:

final_data <- Reduce(function(x, y) merge(x, y, by="DW_Id"),
                     list(data_int_columns,        # SEPARATE QUERY RESULT WITH DW_Id AND INTs IN SELECT
                          data_num_columns,        # SEPARATE QUERY RESULT WITH DW_Id AND DECIMALs IN SELECT 
                          data_dt_columns,         # SEPARATE QUERY RESULT WITH DW_Id AND DATE/TIMEs IN SELECT
                          data_char_columns)       # SEPARATE QUERY RESULT WITH DW_Id AND VARCHARs IN SELECT
              )