用于纯文本输入的带有 tidytext 的简单部分标签

Question

我正在使用 tidytext（和 tidyverse）来分析一些文本数据（如 Tidy Text Mining with R）。

我的输入文本文件 myfile.txt，如下所示：

# Section 1 Name
Lorem ipsum dolor
sit amet ... (et cetera)
# Section 2 Name
<multiple lines here again>

有 60 个左右的部分。

我想用字符串 "Category 1 Name" 或 "Category 2 Name" 生成一个列 section_name 作为相应行的值。例如，我有

library(tidyverse)
library(tidytext)
library(stringr)

fname <- "myfile.txt"
all_text <- readLines(fname)
all_lines <- tibble(text = all_text)
tidiedtext <- all_lines %>%
  mutate(linenumber = row_number(),
         section_id = cumsum(str_detect(text, regex("^#", ignore_case = TRUE)))) %>%
  filter(!str_detect(text, regex("^#"))) %>%
  ungroup()

它在 tidiedtext 中为每一行的相应节号添加一列。

是否可以在对 mutate() 的调用中添加一行来添加这样一列？或者我应该使用另一种方法吗？

Answer 1

为了简单起见，这里使用 grepl 和 if_else 和 tidyr::fill，但原始方法没有任何问题；它与 tidytext 书中使用的非常相似。另请注意，在添加行号后进行过滤会使某些行号不存在。如果重要，请在 filter.

之后添加行号

library(tidyverse)

text <- '# Section 1 Name
Lorem ipsum dolor
sit amet ... (et cetera)
# Section 2 Name
<multiple lines here again>'

all_lines <- data_frame(text = read_lines(text))

tidied <- all_lines %>% 
    mutate(line = row_number(),
           section = if_else(grepl('^#', text), text, NA_character_)) %>% 
  fill(section) %>% 
  filter(!grepl('^#', text))

tidied
#> # A tibble: 3 × 3
#>                          text  line          section
#>                         <chr> <int>            <chr>
#> 1           Lorem ipsum dolor     2 # Section 1 Name
#> 2    sit amet ... (et cetera)     3 # Section 1 Name
#> 3 <multiple lines here again>     5 # Section 2 Name

或者，如果您只想格式化已有的号码，只需将 section_name = paste('Category', section_id, 'Name') 添加到您的 mutate 呼叫中。

Answer 2

我不希望你重写整个脚本，但我只是觉得这个问题很有趣，并想添加一个基本的 R 暂定：

parse_data <- function(file_name) {
  all_rows <- readLines(file_name)
  indices <- which(grepl('#', all_rows))
  splitter <- rep(indices, diff(c(indices, length(all_rows)+1)))
  lst <- split(all_rows, splitter)
  lst <- lapply(lst, function(x) {
    data.frame(section=x[1], value=x[-1], stringsAsFactors = F)
  })
  line_nums = seq_along(all_rows)[-indices]
  df <- do.call(rbind.data.frame, lst)
  cbind.data.frame(df, linenumber = line_nums)
}

正在使用名为 ipsum_data.txt:

的文件进行测试

parse_data('ipsum_data.txt')

产量：

 text                        section          linenumber
 Lorem ipsum dolor           # Section 1 Name 2         
 sit amet ... (et cetera)    # Section 1 Name 3         
 <multiple lines here again> # Section 2 Name 5

文件 ipsum_data.txt 包含：

# Section 1 Name
Lorem ipsum dolor
sit amet ... (et cetera)
# Section 2 Name
<multiple lines here again>

希望这有用。

用于纯文本输入的带有 tidytext 的简单部分标签

Simple section labeling with tidytext for plain text input

r

tidyverse

tidytext