如何从多个 .txt 文件创建 Tibble、在列中保留文件名以及使用文件名将文件分类?

How to Create a Tibble from many .txt Files, Preserve File Names in a Column, and Use File Names to Sort Files into Categories?

我有 584 个 .txt 文件,我想合并成一个 584 x 4 小标题

重要背景信息:

根据文件名中嵌入的标签,文件可分为三类。因此:

A_1_COD.txt, A_23_COD.txt, A _235_COD,...,A_457_COD -> 属于类别 A;

B_3_COD.txt,B_19_COD.txt,B _189_COD,..., B_355_COD -> 属于类别 B;

C_5_COD.txt,C_11_COD.txt,C _196_COD,...,C_513_COD -> 属于类别 C;

为了便于理解,本节中显示的文件名已经过修改。真实文件名的例子是:ENTITY_117_MOR.txt; INCREMENTAL_208_MOR.txt; MODERATE_173_MOR.txt。真实类别 are:ENTITY、INCREMENTAL 和 MODERATE。

生成的 tibble 结构应该是这样的:

小标题:584 x 4

row filename
<?>
category
<fct>
text
<chr>
1 A_1_COD A "Lorem ipsu-
2 B_2_COD B "Lorem ipsu-
3 C_3_COD C "Lorem ipsu-
. . . .
. . . .
. . . .
584 A_584_COD A "Lorem ipsu-

到目前为止我做了什么: 感谢 @awaji98,我设法通过使用以下代码获得了我打算拥有的四列中的三列:

library(tidyverse)
library(readtext)

folder <- "path_to_folder_of_texts"

  dat <- 
  folder %>% 
 # get full path names for each text
  dir(pattern = "*.txt", 
      full.names = T) %>% 
 # map readtext function to each path name into a dataframe
  map_df(., readtext) %>% 
 # add and change columns as desired
  mutate(filename= str_remove(doc_id, ".txt$"),
         category = as.factor(str_extract(doc_id, "^."))) %>% 
  select(filename,category,text) %>% 
  rowid_to_column(var = "row") 

# if you prefer a tibble output
dat %>% tibble()

结果如下图所示:

The picture shows the resulting table with all the data except for category

待解决的问题:我需要让 R 提取文件名中嵌入的类别(即 ENTITY、INCREMENTAL、MODERATE)以填充类别列各自的价值。

@awaji98 建议了两条可能的路径。这是第一个:

> dat <- folder %>% 
+     # get full path names for each text
+     dir(pattern = "*.txt", 
+         full.names = T) %>% 
+     # map readtext function to each path name into a dataframe
+     map_df(., readtext) %>% 
+     # add and change columns as desired
+     mutate(filename= str_remove(doc_id, ".txt$")) %>% 
+     tidyr::extract(filename, into = "category", regex = "^([A-Z]+)_", remove = FALSE) %>% 
+     mutate(category = factor(category)) %>% 
+     select(filename,category,text) %>% 
+     rowid_to_column(var = "row") %>% 
+     tibble()

,这导致一列用红色“NA”填充。

第二个,

> dat <- ## Use tidy::extract to create two new columns from doc_id
+     folder %>% 
+     # get full path names for each text
+     dir(pattern = "*.txt", 
+         full.names = T) %>% 
+     # map readtext function to each path name into a dataframe
+     map_df(., readtext) %>% 
+     # add and change columns as desired
+     mutate(filename= str_remove(doc_id, ".txt$")) %>% 
+     tidyr::extract(doc_id, into = c("category","filename"), regex = "^([A-Z]+)_(.*).txt$") %>% 
+     mutate(category = factor(category)) %>% 
+     select(filename,category,text) %>% 
+     rowid_to_column(var = "row") %>% 
+     tibble()

如下图所示,生成两列填充红色“NA”。
image shows tibble with two columns containing red "NAs," which was not the expected output.

最终解决方案

@awaji98 意识到问题出在正则表达式上。事实证明,文件名有一个尾随的白色space。解决方案是在答案中每个正则表达式的前面添加一个 space 。因此,交付预期结果的代码是:

library(tidyverse)
library(readtext)

folder <- "path_to_folder_of_texts"  
  
dat <-folder %>% 
  # get full path names for each text
  dir(pattern = "*.txt", 
      full.names = T) %>% 
  # map readtext function to each path name into a dataframe
  map_df(., readtext) %>% 
  # add and change columns as desired
  mutate(filename= str_remove(doc_id, ".txt$")) %>% 
  extract(filename, into = "category", regex = "^ ([A-Z]+)_", remove = FALSE) %>% 
 mutate(category = factor(category)) %>% 
  select(filename,category,text) %>% 
  rowid_to_column(var = "row") %>% 
  tibble()

最终效果如下图所示:

亲切的问候,
Á_C

您可以结合使用一些常用 tidyverse 函数和同名包中有用的 readtext()

library(tidyverse)
library(readtext)

folder <- "path_to_folder_of_texts"

  dat <- 
  folder %>% 
 # get full path names for each text
  dir(pattern = "*.txt", 
      full.names = T) %>% 
 # map readtext function to each path name into a dataframe
  map_df(., readtext) %>% 
 # add and change columns as desired
  mutate(filename= str_remove(doc_id, ".txt$"),
         category = as.factor(str_extract(doc_id, "^."))) %>% 
  select(filename,category,text) %>% 
  rowid_to_column(var = "row") 

# if you prefer a tibble output
dat %>% tibble()

更新:

也许以下其中一项能满足您的需求。第一个示例在每个值的前面保留文件名列和类别:

folder %>% 
  # get full path names for each text
  dir(pattern = "*.txt", 
      full.names = T) %>% 
 # map readtext function to each path name into a dataframe
  map_df(., readtext) %>% 
# add and change columns as desired
  mutate(filename= str_remove(doc_id, ".txt$")) %>% 
  extract(filename, into = "category", regex = "^ ([A-Z]+)_", remove = FALSE) %>% 
 mutate(category = factor(category)) %>% 
  select(filename,category,text) %>% 
  rowid_to_column(var = "row") %>% 
  tibble()
  
  

第二个使用 tidyr::extract 从 doc_id 创建两列,因此文件名删除类别部分:

  ## Use tidy::extract to create two new columns from doc_id
  folder %>% 
    # get full path names for each text
    dir(pattern = "*.txt", 
        full.names = T) %>% 
    # map readtext function to each path name into a dataframe
    map_df(., readtext) %>% 
    # add and change columns as desired
    mutate(filename= str_remove(doc_id, ".txt$")) %>% 
    extract(doc_id, into = c("category","filename"), regex = "^ ([A-Z]+)_(.*).txt$") %>% 
    mutate(category = factor(category)) %>% 
    select(filename,category,text) %>% 
    rowid_to_column(var = "row") %>% 
    tibble()