R:如何使用 Quanteda 获取文件名:char_segment

R: How get file name with Quanteda: char_segment

我正在使用 Quanteda 库中的 char_segment 将多个文档从一个文件中分离出来,这些文件由一个模式分隔,这个命令非常好用而且很容易! (我确实尝试过 str_match 和 strsplit 但没有成功)。

遗憾的是我无法获取文件名作为变量,这是下一步分析的关键。example

我的命令示例:

Library(quanteda)
doc <- readtext(paste0("PATH/*.docx"))
View(doc)

docc=char_segment(doc$text,  pattern = ",", remove_pattern = TRUE)

欢迎提出拆分文档的任何建议或其他选项。

只需先获取您的 docx 文件列表,它就会生成文件名。然后 运行 char_segment 通过 lapply、循环或 purrr::map()

对它们起作用

以下代码假定您的目标文档存储在您的工作目录中名为 "docx" 的目录中。

library(quanteda)
library(readtext)  ## Remember to include in your posts the libraries required to replicate the code.


list_of_docx <- list.files(path = "./docx", ## Looks inside the ./docx directory
                       full.names = TRUE,   ## retrieves the full path to the documents
                       pattern = "[.]docx$", ## retrieves al documents whose names ends in ".docx"
                       ignore.case = TRUE)  ## ignores the letter case of the document's names

准备 for 循环

df_docx <- data.frame() ## Create an empty dataframe to store your data

for (d in seq_along(list_of_docx)) {  ## Tell R to run the loop/iterate along the number of elements within thte list of doccument paths
    temp_object <-readtext(list_of_docx[d])
    temp_segmented_object <- char_segment(temp_object$text, pattern = ",", remove_pattern = TRUE)
    temp_df <- as.data.frame(temp_segmented_object)
    colnames(temp_df) <- "segments"
    temp_df$title <- as.character(list_of_docx[d])  ## Create a variable with the title of the source document
    temp_df <- temp_df[, c("title", "segments")]
    df_docx <- rbind(df_docx, temp_df) ## Append each dataframe to the previously created empty dataframe
    rm(temp_df, temp_object, d)
    df_docx
 }


head(df_docx)

您应该已经有了Word文件的名称:

require(readtext)
data_dir <- system.file("extdata/", package = "readtext")
readtext(paste0(data_dir, "/word/*"))

readtext object consisting of 6 documents and 0 docvars.    
# data.frame [6 × 2]
  doc_id                                 text                
  <chr>                                  <chr>               
1 21Parti_Socialiste_SUMMARY_2004.doc    "\"[pic]\nRésu\"..."
2 21vivant2004.doc                       "\"http://www\"..." 
3 21VLD2004.doc                          "\"http://www\"..." 
4 32_socialisti_democratici_italiani.doc "\"DIVENTARE \"..." 
5 UK_2015_EccentricParty.docx            "\"The Eccent\"..." 
6 UK_2015_LoonyParty.docx                "\"The Offici\"..."

它们作为文档名称传递给 quanteda 的下游对象。

Example when I read text 这是我的问题,当我用 ###*

分隔文档时

When I use Char segment