从多个文本文件创建一个 data.frame，行名称作为 r 中的列

Question

我是 r 的新手，我已经无法读取我的文件。

我有一个包含 1100 个 .txt 文件的列表。前 4 行是元数据（“Newspaper”、“Date”、“Ressort”、“Title”）正文从第五行开始。

问题我没有完成 data.frame。我显示为我的第一个 .txt 文件的循环。

所以，这就是我试过的

我用 list.files() 在 r 中读取它们并写了一个 for-loop

datalist <- list.files()

for(i in datalist){
  test <- readLines(i, encoding = 'UTF-8')
}

第一个文件是 test-文件

test <- readLines(i, encoding = 'UTF-8')

test-文件给我元数据

meta <- test[1:4]

然后我将第 5 行定义为文本并删除换行符

text <- paste(test[5:length(test)], collapse = '')

然后我创建我的 data.frame 以元作为列和文本

df <- data.frame(datalist, Newspaper = meta[1], Date = meta[2], Resssort = meta[3], Text = text)
df

写为 csv - 当然

write.csv(df, "test.csv")

现在的问题是，我的列设置得很好，但在每一行中都出现了相同的数据，并且是来自 for 循环中测试的数据。有任何想法吗？如果能得到一些小费或答案，我将非常高兴和感激！大家干杯

Answer 1

我不确定我是否遵循，但是你先做这个循环吗：

for(i in datalist){
  test <- readLines(i, encoding = 'UTF-8')
}

后跟这一行：

test <- readLines(i, encoding = 'UTF-8')

因为如果是这样，你正在重写 test，你将只会在你的测试变量中得到 datalist[i]。您之后所做的任何事情都只会从 datalist[I] 上的这个文件重新采样 - 这基本上应该是数据列表的最后一个文件。

如果不是，那么我不确定哪里出了问题，但如果您怀疑是 for 循环，您也可以这样做：

temp=paste('/completepathof/mapyourfilesarein',list.files(path='/completepathof/mapyourfilesarein'),sep='')
             
myfiles = lapply(temp, readLines,encoding = 'UTF-8')

这应该有效（我将它与 read.csv 一起使用，但不明白为什么它不能与 readLines 一起使用）。

此外，如果您想查明哪里出错了，请在运行一行之后使用命令 str(test) 或 str(df)（这样就是 str（无论您感兴趣的是什么））查看您是否意外更改了某些内容，以及是否以正确的方式读取/转换了列和数据。希望这对您有所帮助。

Answer 2

使用 {purrr} 的 map_dfr 将文件名列表映射到（自定义）函数以读取数据的可能解决方案。这种解决方案的主要优点是您不必创建数据框列表之后将它们合并在一起，通过避免循环，您不必创建会使工作环境混乱的临时对象。在函数内创建的所有对象仅存在于函数内。

缺点是一开始可能很难理解幕后发生的事情，而写for循环时，每一步都更加明确。如果您有时间，我鼓励您花时间观看 Hadley Wickham 的 The Joy of Functional Programming (for Data Science) 视频。在大约 8 分钟后，他谈到了您所面临的这类问题。但是整个视频都值得花时间！ :)

library(tidyverse)
datalist <- list.files("data/newspaper", full.names = T)

custom_read_lines <- function(file) {
  # define function to read files and return a data frame already as expected
  # in final output
  whole_file <- readLines(file)
  text <- paste(whole_file[5:length(whole_file)], collapse = '')

  df <- data.frame(
    Newspaper  = whole_file[1],
    Date       = whole_file[2],
    Ressort    = whole_file[3],
    Title      = whole_file[4],
    Text       = text
  )

  return(df)
}

## using purrr's map_dfr to map each entry of data list to the custom function
df_merged <- datalist %>% map_dfr(custom_read_lines)

df_merged %>% as_tibble() #just for nicer output
# A tibble: 3 x 5
# Newspaper   Date      Ressort Title             text
# <chr>       <chr>     <chr>   <chr>             <chr>
# 1 Nice News   2021-01-… ressort Where does it co… "Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of …
# 2 Boring News 2020-11-… ressort Why do we use it? "It is a long established fact that a reader will be distracted by the readable content of a pa…
# 3 Old News    1990-01-… ressort What is Lorem Ip… "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has bee…



# in case your data looks a bit different and you want to try out first you can always try
# the function for a single file, if it works you can then pass it to the map_dfr function

first_file <- datalist[2]
custom_read_lines(first_file) %>% as_tibble()
# A tibble: 1 x 5
# Newspaper   Date      Ressort Title         text
# <chr>       <chr>     <chr>   <chr>         <chr>
#   1 Boring News 2020-11-… ressort Why do we us… It is a long established fact that a reader will be distracted by the readable content of a page wh…

3 个示例文件中的每一个都或多或少类似于以下内容：

Nice News
2021-01-01
ressort
Where does it come from?
Contrary to popular belief, Lorem Ipsum is not simply random text.
It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old.
Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source.

从多个文本文件创建一个 data.frame，行名称作为 r 中的列

Create a data.frame from multiple textfiles with rownames as columns in r

csv

r

lapply

dataframe

sapply