如何从数据框中的文本中提取第一段?

how to extract the first paragraphs from text in dataframe?

考虑这个数据框

library(dplyr)
library(stringr)


mydf <- data_frame(text = c('Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. \nUt enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. \nDuis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum',
                            'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. \nUt enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. \nDuis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum',
                            'this is a short text without paragraphs! HA!!!'))

我想创建一个列 first_paragraphs,它只包含存储在 mytext 列中的文本的前两段。如您所见,有时甚至没有一个段落(第 3 行)。在这种情况下,保留文本原样即可。

我尝试了以下方法,但没有成功。

#this function finds the position of the second \n in the data
myend <- function(text){
 myend <- str_locate_all(text, "\n")[[2]] %>% as_tibble() %>% pull(end) 
 myend
}

mydf <-mydf %>% mutate(thresh = myend(text),
                       #here I only keep text until that threshold
                       first_paragraphs= str_sub(text, 1, thresh))

Error in mutate_impl(.data, dots) : 
  Evaluation error: subscript out of bounds.

这里有什么问题?

预期输出为:

data_frame(text = c('Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. \nUt enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. ',
                    'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. \nUt enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. ',
                    'this is a short text without paragraphs! HA!!!'))

非常感谢!

这会得到变量 "first_paragraphs" 中的前两段,加上 "thresh variable":

mydf <- data_frame(text = paste0(
  'Lorem ipsum dolor sit amet, '
  'consectetur adipiscing elit, '
  'sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. '
  '\nUt enim ad minim veniam, '
  'quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. '
  '\nDuis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. '
  'Excepteur sint occaecat cupidatat non proident, '
  'sunt in culpa qui officia deserunt mollit anim id est laborum'))

mydf <- mydf %>% mutate(thresh = str_locate_all(mydf$text, "\n")[[1]][2, 2],
                        first_paragraphs = str_sub(text, 1, thresh))

这是一个基本的 R 解决方案 strsplit:

mydf$firstparagraph = paste(strsplit(mydf$text, "\n")[[1]][1:2], collapse = "\n")

结果:

> mydf$firstparagraph
[1] "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. \nUt enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. "

编辑:

使用OP的更新数据集,下面是提取text每行前两段的方法:

mydf$firstparagraph = sapply(strsplit(mydf$text, "\n"), 
                             function(x) sub("\nNA$", "", paste(x[1:2], collapse = "\n")))

为了更好的可读性,您可以使用来自 dplyr:

的管道
library(dplyr)

mydf$text %>%
  strsplit("\n") %>%
  sapply(function(x){
    x[1:2] %>%
      paste(collapse = "\n") %>%
      sub("\nNA$", "", .)
  })

tidyverse:

library(stringr)
library(purrr)

mydf %>%
  mutate(firstparagraph = map(strsplit(text, "\n"), ~{
    .[1:2] %>% 
      paste(collapse = "\n") %>% 
      str_replace("\nNA$", "")
  }))

结果:

> mydf$firstparagraph
[1] "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. \nUt enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. "
[2] "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. \nUt enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. "
[3] "this is a short text without paragraphs! HA!!!" 

sapply 是必需的,因为列 text 现在有多行,因此 strsplit 将输出一个列表,其中每个元素对应于 text 中的一行。 sub 用于删除少于两段的行的额外 \nNA