将大文件分区为 R 中的小文件
Partition a large file into small files in R
我需要将一个大文件 (14 GB) 分成多个小文件。这个文件的格式是txt,制表符是“;”我知道它有 70 列(字符串、双列)。我想读取100万个并将它们保存在不同的文件中,file1,file2 ... fileN。
在@MKR 的帮助下
但是过程很慢,我试过用fread,但是不行。
如何优化此代码?
新代码
chunkSize <- 10000
conex <- file(description = db, open = "r")
data <- read.table(conex, nrows = chunkSize, header=T, fill=TRUE, sep =";")
index <- 0
counter <- 0
total <- 0
chunkSize <- 500000
conex <- file(description=db,open="r")
dataChunk <- read.table(conex, nrows=chunkSize, header=T, fill=TRUE,sep=";")
repeat {
dataChunk <- read.table(conex, nrows=chunkSize, header=FALSE, fill = TRUE, sep=";", col.names=db_colnames)
total <- total + sum(dataChunk$total)
counter <- counter + nrow(dataChunk)
write.table(dataChunk, file = paste0("MY_FILE_new",index),sep=";", row.names = FALSE)
if (nrow(dataChunk) != chunkSize){
print('linesok')
break}
index <- index + 1
print(paste('lines', index * chunkSize))
}
您完全走上了实现解决方案的正确轨道。
The approach should be:
1. Read 1 million lines
2. Write to new files
3. Read next 1 million lines
4. Write to another new files
让我们在 OP 尝试的行中循环转换上述逻辑:
index <- 0
counter <- 0
total <- 0
chunks <- 500000
repeat{
dataChunk <- read.table(con, nrows=chunks, header=FALSE, fill = TRUE,
sep=";", col.names=db_colnames)
# do processing on dataChunk (i.e adding header, converting data type)
# Create a new file name and write to it. You can have your own logic for file names
write.table(dataChunk, file = paste0("file",index))
#check if file end has been reached and break from repeat
if(nrow(dataChunk) < chunks){
break
}
#increment the index to read next chunk
index = index+1
}
已编辑: 根据 OP 的要求修改为通过使用 data.table::fread
读取文件来添加另一个选项。
library(data.table)
index <- 0
counter <- 0
total <- 0
chunks <- 1000000
fileName <- "myfile"
repeat{
# With fread file is opened in each iteration
dataChunk <- fread(input = fileName, nrows=chunks, header=FALSE, fill = TRUE,
skip = chunks*index, sep=";", col.names=db_colnames)
# do processing on dataChunk (i.e adding header, converting data type)
# Create a new file name and write to it. You can have your own logic for file names
write.table(dataChunk, file = paste0("file",index))
#check if file end has been reached and break from repeat
if(nrow(dataChunk) < chunks){
break
}
#increment the index to read next chunk
index = index+1
}
注意: 上面的代码只是 pseudo code
的部分片段以帮助 OP。它不会 运行 并自行产生结果。
不是基于 R 的答案,但在这种情况下,我推荐使用 GNU split
的基于 shell 的解决方案。这应该比 R 解决方案快得多。
要将文件拆分成块,每个块包含 10^6
行,您需要执行以下操作:
split -l 1000000 my_file.txt
有关 split
的详细信息,请参见例如here.
我需要将一个大文件 (14 GB) 分成多个小文件。这个文件的格式是txt,制表符是“;”我知道它有 70 列(字符串、双列)。我想读取100万个并将它们保存在不同的文件中,file1,file2 ... fileN。
在@MKR 的帮助下
但是过程很慢,我试过用fread,但是不行。
如何优化此代码?
新代码
chunkSize <- 10000
conex <- file(description = db, open = "r")
data <- read.table(conex, nrows = chunkSize, header=T, fill=TRUE, sep =";")
index <- 0
counter <- 0
total <- 0
chunkSize <- 500000
conex <- file(description=db,open="r")
dataChunk <- read.table(conex, nrows=chunkSize, header=T, fill=TRUE,sep=";")
repeat {
dataChunk <- read.table(conex, nrows=chunkSize, header=FALSE, fill = TRUE, sep=";", col.names=db_colnames)
total <- total + sum(dataChunk$total)
counter <- counter + nrow(dataChunk)
write.table(dataChunk, file = paste0("MY_FILE_new",index),sep=";", row.names = FALSE)
if (nrow(dataChunk) != chunkSize){
print('linesok')
break}
index <- index + 1
print(paste('lines', index * chunkSize))
}
您完全走上了实现解决方案的正确轨道。
The approach should be: 1. Read 1 million lines 2. Write to new files 3. Read next 1 million lines 4. Write to another new files
让我们在 OP 尝试的行中循环转换上述逻辑:
index <- 0
counter <- 0
total <- 0
chunks <- 500000
repeat{
dataChunk <- read.table(con, nrows=chunks, header=FALSE, fill = TRUE,
sep=";", col.names=db_colnames)
# do processing on dataChunk (i.e adding header, converting data type)
# Create a new file name and write to it. You can have your own logic for file names
write.table(dataChunk, file = paste0("file",index))
#check if file end has been reached and break from repeat
if(nrow(dataChunk) < chunks){
break
}
#increment the index to read next chunk
index = index+1
}
已编辑: 根据 OP 的要求修改为通过使用 data.table::fread
读取文件来添加另一个选项。
library(data.table)
index <- 0
counter <- 0
total <- 0
chunks <- 1000000
fileName <- "myfile"
repeat{
# With fread file is opened in each iteration
dataChunk <- fread(input = fileName, nrows=chunks, header=FALSE, fill = TRUE,
skip = chunks*index, sep=";", col.names=db_colnames)
# do processing on dataChunk (i.e adding header, converting data type)
# Create a new file name and write to it. You can have your own logic for file names
write.table(dataChunk, file = paste0("file",index))
#check if file end has been reached and break from repeat
if(nrow(dataChunk) < chunks){
break
}
#increment the index to read next chunk
index = index+1
}
注意: 上面的代码只是 pseudo code
的部分片段以帮助 OP。它不会 运行 并自行产生结果。
不是基于 R 的答案,但在这种情况下,我推荐使用 GNU split
的基于 shell 的解决方案。这应该比 R 解决方案快得多。
要将文件拆分成块,每个块包含 10^6
行,您需要执行以下操作:
split -l 1000000 my_file.txt
有关 split
的详细信息,请参见例如here.