拆分和分组纯文本（按数据框中的章节分组文本）？

Question

我有一个数据 frame/tibble，我在其中导入了一个纯文本 (txt) 文件。文本非常一致，并按章节分组。有时章节文本只有一行，有时是多行。数据在这样的一列中：

# A tibble: 10,708 x 1
   x                                                                     
   <chr>                                                                                                                                   
 1 "Chapter 1 "                                                          
 2 "Chapter text. "     
 3 "Chapter 2 "                                                          
 4 "Chapter text. "    
 5 "Chapter 3 "
 6 "Chapter text. "
 7 "Chapter text. "
 8 "Chapter 4 "

我正在尝试清理数据，以便为章节创建一个新列，并将每一章的文本添加到另一列中，如下所示：

# A tibble: 10,548 x 2
   x                                Chapter   
   <chr>                             <chr>
 1 "Chapter text. "               "Chapter 1 "
 2 "Chapter text. "               "Chapter 2 "
 3 "Chapter text. "               "Chapter 3 " 
 4 "Chapter text. "               "Chapter 4 "

我一直在尝试使用正则表达式在每次出现单词 'Chapter #' 时对数据进行拆分和分组（章节后跟数字，但无法得到我想要的结果。任何建议都很多赞赏。

Answer 1

基于"Sometimes the chapter text is only one row, sometimes it's multiple row" 我假设第 6 行和第 7 行中的文本属于第 3 章并且您的测试数据中没有第 4 章的文本（您想要的输出可能有点错误）。

这是使用 dplyr 和 tidyr 的方法。只需运行一点一点，您就会看到数据是如何转换的。

df %>% 
  mutate(
    id = cumsum(grepl("[0-9].$", x)),
    x = ifelse(grepl("[0-9].$", x), paste0(x, ":"), x)
  ) %>% 
  group_by(id) %>% 
  summarize(
    chapter = paste0(x, collapse = "")
  ) %>% 
  separate(chapter, into = c("chapter", "text"), sep = ":", extra = "merge")

# A tibble: 4 x 3
     id chapter      text                          
  <int> <chr>        <chr>                         
1     1 "Chapter 1 " "Chapter text. "              
2     2 "Chapter 2 " "Chapter text. "              
3     3 "Chapter 3 " "Chapter text. Chapter text. "
4     4 "Chapter 4 " ""

数据-

df <- structure(list(x = c("Chapter 1 ", "Chapter text. ", "Chapter 2 ", 
"Chapter text. ", "Chapter 3 ", "Chapter text. ", "Chapter text. ", 
"Chapter 4 ")), .Names = "x", class = "data.frame", row.names = c(NA, 
-8L))

拆分和分组纯文本（按数据框中的章节分组文本）？

Splitting and grouping plain text (grouping text by chapter in dataframe)?

nlp

r

text-mining

tidytext