如何解析出特定的文本部分?

How do I parse out a specific section of text?

我的目标是根据关键字在一组word文档中拉出特定的部分。我无法从更大的文本文件数据集中解析出特定的文本部分。数据集本来是这样的,用"title 1"和"title 2"表示我感兴趣的文本的开始和结束,不重要的词表示文本文件中我不感兴趣的部分:

**Text**           **Text File** 
title one           Text file 1
sentence one        Text file 1
sentence two        Text file 1
title two           Text file 1
unimportant words   Text file 1
title one           Text file 2
sentence one        Text file 2

然后我用as.character把数据转成字符,用unnest_tokens整理数据

df <- data.frame(lapply(df, as.character), stringsAsFactors=FALSE)
tidy_df <- df %>% unnest_tokens(word, Text, token = "words")

我现在只想查看数据集中的句子,排除不重要的词。每个文本文件中的标题一和标题二相同,但它们之间的句子不同。我试过下面这段代码,但它似乎不起作用。

filtered_resume <- lapply(tidy_resume, (tidy_resume %>% select(Name) %>% filter(title:two)))

不熟悉 tidytext 包,所以这里有一个替代的基础 R 解决方案。使用此扩展示例数据(底部包含创建代码):

> df
                Text        File
1          title one Text file 1
2       sentence one Text file 1
3       sentence two Text file 1
4          title two Text file 1
5  unimportant words Text file 1
6          title one Text file 2
7       sentence one Text file 2
8       sentence two Text file 2
9     sentence three Text file 2
10         title two Text file 2
11 unimportant words Text file 2

创建一个函数,根据 Text 列中的值,创建一个单独的列,指示给定行是应该保留还是删除。评论详情:

get_important_sentences <- function(df_) {
  # Create some variables for filtering
  val = 1
  keep = c()

  # For every text row
  for (x in df_$Text) {
    # Multiply the current val by 2
    val = val * 2

    # If the current text includes "title",
    # set val to 1 for 'title one', and to 2
    # for 'title two'
    if (grepl("title", x)) {
      val = ifelse(grepl("one", x), 1, 0)
    }

    # append val to keep each time
    keep = c(keep, val)
  }

  # keep is now a numeric vector- add it to
  # the data frame
  df_$keep = keep

  # exclude any rows where 'keep' is 1 (for
  # 'title one') or 0 (for 'title 2' or any
  # unimportant words). Also, drop the
  return(df_[df_$keep > 1, c("Text", "File")])
}

然后你可以在整个数据帧上调用它:

> get_important_sentences(df)
            Text        File
2   sentence one Text file 1
3   sentence two Text file 1
7   sentence one Text file 2
8   sentence two Text file 2
9 sentence three Text file 2

或基于每个文件源 lapply:

> lapply(split(df, df$File), get_important_sentences)
$`Text file 1`
          Text        File
2 sentence one Text file 1
3 sentence two Text file 1

$`Text file 2`
            Text        File
7   sentence one Text file 2
8   sentence two Text file 2
9 sentence three Text file 2

数据:

df <-
  data.frame(
    Text = c(
      "title one",
      "sentence one",
      "sentence two",
      "title two",
      "unimportant words",
      "title one",
      "sentence one",
      "sentence two",
      "sentence three",
      "title two",
      "unimportant words"
    ),
    File = c(rep("Text file 1", 5), rep("Text file 2", 6)),
    stringsAsFactors = FALSE
  )

如果您想要一个代码行很少的 tidyverse 选项,请看一下。您可以使用 case_when()str_detect() 在数据框中查找包含 important/not 重要信号的行。

library(tidyverse)

df1 <- df %>%
  mutate(important = case_when(str_detect(Text, "title one") ~ TRUE,
                               str_detect(Text, "title two") ~ FALSE))
df1 
#> # A tibble: 11 x 3
#>    Text              File        important
#>    <chr>             <chr>       <lgl>    
#>  1 title one         Text file 1 TRUE     
#>  2 sentence one      Text file 1 NA       
#>  3 sentence two      Text file 1 NA       
#>  4 title two         Text file 1 FALSE    
#>  5 unimportant words Text file 1 NA       
#>  6 title one         Text file 2 TRUE     
#>  7 sentence one      Text file 2 NA       
#>  8 sentence two      Text file 2 NA       
#>  9 sentence three    Text file 2 NA       
#> 10 title two         Text file 2 FALSE    
#> 11 unimportant words Text file 2 NA

现在您可以使用 tidyr 中的 fill() 来填充这些值。

df1 %>%
  fill(important, .direction = "down")
#> # A tibble: 11 x 3
#>    Text              File        important
#>    <chr>             <chr>       <lgl>    
#>  1 title one         Text file 1 TRUE     
#>  2 sentence one      Text file 1 TRUE     
#>  3 sentence two      Text file 1 TRUE     
#>  4 title two         Text file 1 FALSE    
#>  5 unimportant words Text file 1 FALSE    
#>  6 title one         Text file 2 TRUE     
#>  7 sentence one      Text file 2 TRUE     
#>  8 sentence two      Text file 2 TRUE     
#>  9 sentence three    Text file 2 TRUE     
#> 10 title two         Text file 2 FALSE    
#> 11 unimportant words Text file 2 FALSE

reprex package (v0.2.0) 创建于 2018-08-14。

此时,您可以filter(important)只保留您想要的文本,然后您可以使用 tidytext 中的函数对您留下的重要文本进行文本挖掘。