使用 R 中的 stringr 提取特定单词周围的单词样本

Extract a sample of words around a particular word using stringr in R

我在 SO 上看到过关于此主题的几个类似问题,但它们的措辞似乎不正确 (example) or in a different language (example)。

在我的场景中,我认为所有被白色包围的东西 space 都是一个词。表情符号、数字、不是真正单词的字母串,我不在乎。我只想了解找到的字符串的一些上下文,而不必读取整个文件来确定它是否是有效匹配。

我尝试使用以下方法,但如果您的文本文件很长,运行 需要一段时间:

text <- "He served both as Attorney General and Lord Chancellor of England. After his death, he remained extremely influential through his works, especially as philosophical advocate and practitioner of the scientific method during the scientific revolution. Bacon has been called the father of empiricism.[6] His works argued for the possibility of scientific knowledge based only upon inductive and careful observation of events in nature. Most importantly, he argued this could be achieved by use of a skeptical and methodical approach whereby scientists aim to avoid misleading themselves. While his own practical ideas about such a method, the Baconian method, did not have a long lasting influence, the general idea of the importance and possibility of a skeptical methodology makes Bacon the father of scientific method. This marked a new turn in the rhetorical and theoretical framework for science, the practical details of which are still central in debates about science and methodology today. Bacon was knighted in 1603 and created Baron Verulam in 1618[4] and Viscount St. Alban in 1621;[3][b] as he died without heirs, both titles became extinct upon his death. Bacon died of pneumonia in 1626, with one account by John Aubrey stating he contracted the condition while studying the effects of freezing on the preservation of meat."

stringr::str_extract(text, "(.*?\s){1,10}Verulam(\s.*?){1,10}")

我假设有很多很多 faster/more 有效的方法来做到这一点,是吗?

我会使用 unlist(strsplit) 然后索引生成的向量。您可以将其设为一个函数,以便获取 pre 和 post 的单词数是一个灵活的参数:

getContext <- function(text, look_for, pre = 3, post=pre) {
  # create vector of words (anything separated by a space)
  t_vec <- unlist(strsplit(text, '\s'))

  # find position of matches
  matches <- which(t_vec==look_for)

  # return words before & after if any matches
  if(length(matches) > 0) {
    out <- 
      list(before = ifelse(m-pre < 1, NA, 
                           sapply(matches, function(m) t_vec[(m - pre):(m - 1)])), 
           after = sapply(matches, function(m) t_vec[(m + 1):(m + post)]))

  } else {
    warning('No matches')


getContext(text, 'Verulam')

# $before
#      [,1]     
# [1,] "and"    
# [2,] "created"
# [3,] "Baron"  
# $after
#      [,1]     
# [1,] "in"     
# [2,] "1618[4]"
# [3,] "and"   


getContext(text, 'he')

# $before
#      [,1]     [,2]           [,3]          [,4]     
# [1,] "After"  "nature."      "in"          "John"   
# [2,] "his"    "Most"         "1621;[3][b]" "Aubrey" 
# [3,] "death," "importantly," "as"          "stating"
# $after
#      [,1]          [,2]     [,3]      [,4]        
# [1,] "remained"    "argued" "died"    "contracted"
# [2,] "extremely"   "this"   "without" "the"       
# [3,] "influential" "could"  "heirs,"  "condition" 

getContext(text, 'fruitloops')
# Warning message:
#   In getContext(text, "fruitloops") : No matches

如果您不介意将数据一式三份,您可以制作一个 data.frame,这通常是在 R 中使用的最佳选择。

context <- function(text){
  splittedText <- strsplit(text, ' ', T)[[1]]

    words  = splittedText,
    before = head(c('', splittedText), -1), 
    after  = tail(c(splittedText, ''), -1)


info <- context(text)

print(subset(info, words == 'Verulam'))

print(subset(info, before == 'Lord'))

print(subset(info, grepl('[[:digit:]]', words)))

#       words before #after
# 161 Verulam  Baron    in
#        words before after
# 9 Chancellor   Lord    of
#             words before after
# 43  empiricism.[6]     of   His
# 157           1603     in   and
# 163        1618[4]     in   and
# 169    1621;[3][b]     in    as
# 187          1626,     in  with


stringr::str_extract(text, "([^\s]+\s){3}Verulam(\s[^\s]+){3}")
# alternately, if you like " " more than \s:
# stringr::str_extract(text, "(?:[^ ]+ ){3}Verulam(?: [^ ]+){3}")

#[1] "and created Baron Verulam in 1618[4] and"

根据您的需要更改 {} 中的数字。

您也可以使用非捕获 (?:) 组,但我不确定这是否会提高速度。

stringr::str_extract(text, "(?:[^\s]+\s){3}Verulam(?:\s[^\s]+){3}")