将 txt 文件的目录逐行读入 R 数据框中,文件名作为一列
Read a directory of txt files line by line into an R dataframe with filenames as one column
我有一个文本文件目录。我想将这些文本文件的内容逐行读取到 R 数据框中。文本文件包含非结构化文本。所需的数据帧输出是:
file; line
1.txt; "line 1 in 1.txt"
1.txt; "line 2 in 1.txt"
2.txt; "line 1 in 2.txt"
...
我写了下面的代码,但是它导致了错误。我也猜想有一种更直接的方法可以做到这一点,例如 readr
和 dplyr
.
files <- list.files(path="./data", pattern = "*.txt", full.names = TRUE) # read data folder txt files
my_lines <-list() # create temp list for reading lines
df <- data_frame( "file" = character(0), "line" = character(0))
for (file in files){
my_lines <- readLines(file) # read lines from file into a list
for (line in my_lines){
df$file<-file
df$fline<-line
}
}
一个简单(但效率低下)的解决方案是:
files <- list.files(path="./data", pattern = "*.txt", full.names = TRUE)
fls <- NULL
lns <- NULL
for (file in files) {
my_lines <- readLines(file)
for (line in my_lines) {
fls <- c(fls, file)
lns <- c(lns, line)
}
}
df <- data.frame(file=fls, fline=lns)
print(df)
file fline
1 1.txt line1_in_1.txt
2 1.txt line2_in_1.txt
3 2.txt line1_in_2.txt
4 2.txt line2_in_2.txt
没有循环的替代解决方案:
> file = list.files(path="C:/...", pattern = "*.txt",full.names=T)
> line = lapply(file,readLines)
> file = unlist(mapply(rep,file,sapply(line,length),SIMPLIFY=FALSE,USE.NAMES=FALSE))
> df=data.frame(file=file,line=unlist(line))
将full.names
设置为TRUE
会使文件名变得很长...
如果您预先设置工作目录,则 path
和 full.names
参数到 list.files()
将不是必需的,并且
您的数据框将仅包含 实际 没有路径的文件名。
我有一个文本文件目录。我想将这些文本文件的内容逐行读取到 R 数据框中。文本文件包含非结构化文本。所需的数据帧输出是:
file; line
1.txt; "line 1 in 1.txt"
1.txt; "line 2 in 1.txt"
2.txt; "line 1 in 2.txt"
...
我写了下面的代码,但是它导致了错误。我也猜想有一种更直接的方法可以做到这一点,例如 readr
和 dplyr
.
files <- list.files(path="./data", pattern = "*.txt", full.names = TRUE) # read data folder txt files
my_lines <-list() # create temp list for reading lines
df <- data_frame( "file" = character(0), "line" = character(0))
for (file in files){
my_lines <- readLines(file) # read lines from file into a list
for (line in my_lines){
df$file<-file
df$fline<-line
}
}
一个简单(但效率低下)的解决方案是:
files <- list.files(path="./data", pattern = "*.txt", full.names = TRUE)
fls <- NULL
lns <- NULL
for (file in files) {
my_lines <- readLines(file)
for (line in my_lines) {
fls <- c(fls, file)
lns <- c(lns, line)
}
}
df <- data.frame(file=fls, fline=lns)
print(df)
file fline
1 1.txt line1_in_1.txt
2 1.txt line2_in_1.txt
3 2.txt line1_in_2.txt
4 2.txt line2_in_2.txt
没有循环的替代解决方案:
> file = list.files(path="C:/...", pattern = "*.txt",full.names=T)
> line = lapply(file,readLines)
> file = unlist(mapply(rep,file,sapply(line,length),SIMPLIFY=FALSE,USE.NAMES=FALSE))
> df=data.frame(file=file,line=unlist(line))
将full.names
设置为TRUE
会使文件名变得很长...
如果您预先设置工作目录,则 path
和 full.names
参数到 list.files()
将不是必需的,并且
您的数据框将仅包含 实际 没有路径的文件名。