如何从多个 .txt 文件创建 Tibble、在列中保留文件名以及使用文件名将文件分类?
How to Create a Tibble from many .txt Files, Preserve File Names in a Column, and Use File Names to Sort Files into Categories?
我有 584 个 .txt 文件,我想合并成一个 584 x 4 小标题。
重要背景信息:
根据文件名中嵌入的标签,文件可分为三类。因此:
A_1_COD.txt, A_23_COD.txt, A _235_COD,...,A_457_COD -> 属于类别 A;
B_3_COD.txt,B_19_COD.txt,B _189_COD,..., B_355_COD -> 属于类别 B;
C_5_COD.txt,C_11_COD.txt,C _196_COD,...,C_513_COD -> 属于类别 C;
为了便于理解,本节中显示的文件名已经过修改。真实文件名的例子是:ENTITY_117_MOR.txt; INCREMENTAL_208_MOR.txt; MODERATE_173_MOR.txt。真实类别 are:ENTITY、INCREMENTAL 和 MODERATE。
生成的 tibble 结构应该是这样的:
小标题:584 x 4
row
filename
<?>
category
<fct>
text
<chr>
1
A_1_COD
A
"Lorem ipsu-
2
B_2_COD
B
"Lorem ipsu-
3
C_3_COD
C
"Lorem ipsu-
.
.
.
.
.
.
.
.
.
.
.
.
584
A_584_COD
A
"Lorem ipsu-
到目前为止我做了什么:
感谢 @awaji98,我设法通过使用以下代码获得了我打算拥有的四列中的三列:
library(tidyverse)
library(readtext)
folder <- "path_to_folder_of_texts"
dat <-
folder %>%
# get full path names for each text
dir(pattern = "*.txt",
full.names = T) %>%
# map readtext function to each path name into a dataframe
map_df(., readtext) %>%
# add and change columns as desired
mutate(filename= str_remove(doc_id, ".txt$"),
category = as.factor(str_extract(doc_id, "^."))) %>%
select(filename,category,text) %>%
rowid_to_column(var = "row")
# if you prefer a tibble output
dat %>% tibble()
结果如下图所示:
The picture shows the resulting table with all the data except for category
待解决的问题:我需要让 R 提取文件名中嵌入的类别(即 ENTITY、INCREMENTAL、MODERATE)以填充类别列各自的价值。
@awaji98 建议了两条可能的路径。这是第一个:
> dat <- folder %>%
+ # get full path names for each text
+ dir(pattern = "*.txt",
+ full.names = T) %>%
+ # map readtext function to each path name into a dataframe
+ map_df(., readtext) %>%
+ # add and change columns as desired
+ mutate(filename= str_remove(doc_id, ".txt$")) %>%
+ tidyr::extract(filename, into = "category", regex = "^([A-Z]+)_", remove = FALSE) %>%
+ mutate(category = factor(category)) %>%
+ select(filename,category,text) %>%
+ rowid_to_column(var = "row") %>%
+ tibble()
,这导致一列用红色“NA”填充。
第二个,
> dat <- ## Use tidy::extract to create two new columns from doc_id
+ folder %>%
+ # get full path names for each text
+ dir(pattern = "*.txt",
+ full.names = T) %>%
+ # map readtext function to each path name into a dataframe
+ map_df(., readtext) %>%
+ # add and change columns as desired
+ mutate(filename= str_remove(doc_id, ".txt$")) %>%
+ tidyr::extract(doc_id, into = c("category","filename"), regex = "^([A-Z]+)_(.*).txt$") %>%
+ mutate(category = factor(category)) %>%
+ select(filename,category,text) %>%
+ rowid_to_column(var = "row") %>%
+ tibble()
如下图所示,生成两列填充红色“NA”。
image shows tibble with two columns containing red "NAs," which was not the expected output.
最终解决方案
@awaji98 意识到问题出在正则表达式上。事实证明,文件名有一个尾随的白色space。解决方案是在答案中每个正则表达式的前面添加一个 space 。因此,交付预期结果的代码是:
library(tidyverse)
library(readtext)
folder <- "path_to_folder_of_texts"
dat <-folder %>%
# get full path names for each text
dir(pattern = "*.txt",
full.names = T) %>%
# map readtext function to each path name into a dataframe
map_df(., readtext) %>%
# add and change columns as desired
mutate(filename= str_remove(doc_id, ".txt$")) %>%
extract(filename, into = "category", regex = "^ ([A-Z]+)_", remove = FALSE) %>%
mutate(category = factor(category)) %>%
select(filename,category,text) %>%
rowid_to_column(var = "row") %>%
tibble()
最终效果如下图所示:
亲切的问候,
Á_C
您可以结合使用一些常用 tidyverse
函数和同名包中有用的 readtext()
:
library(tidyverse)
library(readtext)
folder <- "path_to_folder_of_texts"
dat <-
folder %>%
# get full path names for each text
dir(pattern = "*.txt",
full.names = T) %>%
# map readtext function to each path name into a dataframe
map_df(., readtext) %>%
# add and change columns as desired
mutate(filename= str_remove(doc_id, ".txt$"),
category = as.factor(str_extract(doc_id, "^."))) %>%
select(filename,category,text) %>%
rowid_to_column(var = "row")
# if you prefer a tibble output
dat %>% tibble()
更新:
也许以下其中一项能满足您的需求。第一个示例在每个值的前面保留文件名列和类别:
folder %>%
# get full path names for each text
dir(pattern = "*.txt",
full.names = T) %>%
# map readtext function to each path name into a dataframe
map_df(., readtext) %>%
# add and change columns as desired
mutate(filename= str_remove(doc_id, ".txt$")) %>%
extract(filename, into = "category", regex = "^ ([A-Z]+)_", remove = FALSE) %>%
mutate(category = factor(category)) %>%
select(filename,category,text) %>%
rowid_to_column(var = "row") %>%
tibble()
第二个使用 tidyr::extract
从 doc_id 创建两列,因此文件名删除类别部分:
## Use tidy::extract to create two new columns from doc_id
folder %>%
# get full path names for each text
dir(pattern = "*.txt",
full.names = T) %>%
# map readtext function to each path name into a dataframe
map_df(., readtext) %>%
# add and change columns as desired
mutate(filename= str_remove(doc_id, ".txt$")) %>%
extract(doc_id, into = c("category","filename"), regex = "^ ([A-Z]+)_(.*).txt$") %>%
mutate(category = factor(category)) %>%
select(filename,category,text) %>%
rowid_to_column(var = "row") %>%
tibble()
我有 584 个 .txt 文件,我想合并成一个 584 x 4 小标题。
重要背景信息:
根据文件名中嵌入的标签,文件可分为三类。因此:
A_1_COD.txt, A_23_COD.txt, A _235_COD,...,A_457_COD -> 属于类别 A;
B_3_COD.txt,B_19_COD.txt,B _189_COD,..., B_355_COD -> 属于类别 B;
C_5_COD.txt,C_11_COD.txt,C _196_COD,...,C_513_COD -> 属于类别 C;
为了便于理解,本节中显示的文件名已经过修改。真实文件名的例子是:ENTITY_117_MOR.txt; INCREMENTAL_208_MOR.txt; MODERATE_173_MOR.txt。真实类别 are:ENTITY、INCREMENTAL 和 MODERATE。
生成的 tibble 结构应该是这样的:
小标题:584 x 4
row | filename <?> |
category <fct> |
text <chr> |
---|---|---|---|
1 | A_1_COD | A | "Lorem ipsu- |
2 | B_2_COD | B | "Lorem ipsu- |
3 | C_3_COD | C | "Lorem ipsu- |
. | . | . | . |
. | . | . | . |
. | . | . | . |
584 | A_584_COD | A | "Lorem ipsu- |
到目前为止我做了什么: 感谢 @awaji98,我设法通过使用以下代码获得了我打算拥有的四列中的三列:
library(tidyverse)
library(readtext)
folder <- "path_to_folder_of_texts"
dat <-
folder %>%
# get full path names for each text
dir(pattern = "*.txt",
full.names = T) %>%
# map readtext function to each path name into a dataframe
map_df(., readtext) %>%
# add and change columns as desired
mutate(filename= str_remove(doc_id, ".txt$"),
category = as.factor(str_extract(doc_id, "^."))) %>%
select(filename,category,text) %>%
rowid_to_column(var = "row")
# if you prefer a tibble output
dat %>% tibble()
结果如下图所示:
The picture shows the resulting table with all the data except for category
待解决的问题:我需要让 R 提取文件名中嵌入的类别(即 ENTITY、INCREMENTAL、MODERATE)以填充类别列各自的价值。
@awaji98 建议了两条可能的路径。这是第一个:
> dat <- folder %>%
+ # get full path names for each text
+ dir(pattern = "*.txt",
+ full.names = T) %>%
+ # map readtext function to each path name into a dataframe
+ map_df(., readtext) %>%
+ # add and change columns as desired
+ mutate(filename= str_remove(doc_id, ".txt$")) %>%
+ tidyr::extract(filename, into = "category", regex = "^([A-Z]+)_", remove = FALSE) %>%
+ mutate(category = factor(category)) %>%
+ select(filename,category,text) %>%
+ rowid_to_column(var = "row") %>%
+ tibble()
,这导致一列用红色“NA”填充。
第二个,
> dat <- ## Use tidy::extract to create two new columns from doc_id
+ folder %>%
+ # get full path names for each text
+ dir(pattern = "*.txt",
+ full.names = T) %>%
+ # map readtext function to each path name into a dataframe
+ map_df(., readtext) %>%
+ # add and change columns as desired
+ mutate(filename= str_remove(doc_id, ".txt$")) %>%
+ tidyr::extract(doc_id, into = c("category","filename"), regex = "^([A-Z]+)_(.*).txt$") %>%
+ mutate(category = factor(category)) %>%
+ select(filename,category,text) %>%
+ rowid_to_column(var = "row") %>%
+ tibble()
如下图所示,生成两列填充红色“NA”。
image shows tibble with two columns containing red "NAs," which was not the expected output.
最终解决方案
@awaji98 意识到问题出在正则表达式上。事实证明,文件名有一个尾随的白色space。解决方案是在答案中每个正则表达式的前面添加一个 space 。因此,交付预期结果的代码是:
library(tidyverse)
library(readtext)
folder <- "path_to_folder_of_texts"
dat <-folder %>%
# get full path names for each text
dir(pattern = "*.txt",
full.names = T) %>%
# map readtext function to each path name into a dataframe
map_df(., readtext) %>%
# add and change columns as desired
mutate(filename= str_remove(doc_id, ".txt$")) %>%
extract(filename, into = "category", regex = "^ ([A-Z]+)_", remove = FALSE) %>%
mutate(category = factor(category)) %>%
select(filename,category,text) %>%
rowid_to_column(var = "row") %>%
tibble()
最终效果如下图所示:
Á_C
您可以结合使用一些常用 tidyverse
函数和同名包中有用的 readtext()
:
library(tidyverse)
library(readtext)
folder <- "path_to_folder_of_texts"
dat <-
folder %>%
# get full path names for each text
dir(pattern = "*.txt",
full.names = T) %>%
# map readtext function to each path name into a dataframe
map_df(., readtext) %>%
# add and change columns as desired
mutate(filename= str_remove(doc_id, ".txt$"),
category = as.factor(str_extract(doc_id, "^."))) %>%
select(filename,category,text) %>%
rowid_to_column(var = "row")
# if you prefer a tibble output
dat %>% tibble()
更新:
也许以下其中一项能满足您的需求。第一个示例在每个值的前面保留文件名列和类别:
folder %>%
# get full path names for each text
dir(pattern = "*.txt",
full.names = T) %>%
# map readtext function to each path name into a dataframe
map_df(., readtext) %>%
# add and change columns as desired
mutate(filename= str_remove(doc_id, ".txt$")) %>%
extract(filename, into = "category", regex = "^ ([A-Z]+)_", remove = FALSE) %>%
mutate(category = factor(category)) %>%
select(filename,category,text) %>%
rowid_to_column(var = "row") %>%
tibble()
第二个使用 tidyr::extract
从 doc_id 创建两列,因此文件名删除类别部分:
## Use tidy::extract to create two new columns from doc_id
folder %>%
# get full path names for each text
dir(pattern = "*.txt",
full.names = T) %>%
# map readtext function to each path name into a dataframe
map_df(., readtext) %>%
# add and change columns as desired
mutate(filename= str_remove(doc_id, ".txt$")) %>%
extract(doc_id, into = c("category","filename"), regex = "^ ([A-Z]+)_(.*).txt$") %>%
mutate(category = factor(category)) %>%
select(filename,category,text) %>%
rowid_to_column(var = "row") %>%
tibble()