使用r将多个csv文件导入postgresql数据库（内存错误）

Question

我正在尝试将数据集（包含许多 csv 文件）导入 r，然后将数据写入 postgresql 数据库中的 table。

我成功连接到数据库，创建了一个循环来导入 csv 文件并尝试导入。 R 然后 returns 一个错误，因为我的电脑运行内存不足。

我的问题是：有没有办法创建一个循环，一个接一个地导入文件，将它们写入 postgresql table 然后删除它们？这样我就不会运行内存不足。

returns内存错误的代码：

`#connect to PostgreSQL database
db_tankdata <- 'tankdaten'  
host_db <- 'localhost'
db_port <- '5432'
db_user <- 'postgres'  
db_password <- 'xxx'
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname = db_tankdata, host=host_db, 
                 port=db_port, user=db_user, password=db_password)

#check if connection was succesfull
dbExistsTable(con, "prices")

#create function to load multiple csv files
import_csvfiles <- function(path){
  files <- list.files(path, pattern = "*.csv",recursive = TRUE, full.names = TRUE)
  lapply(files,read_csv) %>% bind_rows() %>% as.data.frame()
    }


#import files
prices <- import_csvfiles("path...")
dbWriteTable(con, "prices", prices , append = TRUE, row.names = FALSE)`

提前感谢您的反馈！

Answer 1

如果您更改 lapply() 以包含匿名函数，您可以读取每个文件并将其写入数据库，从而减少所需的内存量。由于 lapply() 充当隐含的 for() 循环，因此您不需要额外的循环机制。

import_csvfiles <- function(path){
     files <- list.files(path, pattern = "*.csv",recursive = TRUE, full.names = TRUE)
     lapply(files,function(x){ 
          prices <- read.csv(x) 
          dbWriteTable(con, "prices", prices , append = TRUE, row.names = FALSE)
          })
}

Answer 2

我假设您要导入数据库的 csv 文件非常大？据我所知，R 首先想用您编写的代码将数据存储在数据框中，将数据存储在内存中。另一种方法是像读取 Python 的 Pandas.

那样分块读取 CSV 文件

调用 ?read.csv 时，我看到了以下输出：

nrows : 读入的最大行数。忽略负数和其他无效值。

skip : 开始读取数据前要跳过的数据文件行数。

为什么不尝试一次将 5000 行读入数据帧写入 PostgreSQL 数据库，然后对每个文件执行此操作。

例如，对每个文件执行以下操作：

number_of_lines = 5000                 # Number of lines to read at a time
row_skip = 0                           # number of lines to skip initially
keep_reading = TRUE                    # We will change this value to stop the while

while (keep_reading) {
    my_data <- read.csv(x, nrow = number_of_lines , skip = row_skip)
    dbWriteTable(con, "prices", my_data , append = TRUE, row.names = FALSE) # Write to the DB

    row_skip = 1 + row_skip + number_of_lines   # The "1 +" is there due to inclusivity avoiding duplicates

# Exit Statement: if the number of rows read is no more the size of the total lines to read per read.csv(...)
if(nrow(my_data) < number_of_lines){
   keep_reading = FALSE
    } # end-if    
} # end-while

这样做就是将 csv 分解成更小的部分。您可以使用 number_of_lines 变量来减少循环次数。这可能看起来有点 hacky 涉及循环，但我相信它会工作

使用r将多个csv文件导入postgresql数据库（内存错误）

Import multiple csv files into postgresql database using r (memory error)

csv

r

rpostgresql