从 R 中的文本中提取文本引用(字符串)

Extracting in-text citations (character strings) from text in R

我正在尝试编写一个允许我粘贴书面文本的函数,它会 return 一个在写作中使用的文本引用列表。例如,这是我目前拥有的:

pull_cites<- function (text){
gsub("[\(\)]", "", regmatches(text, gregexpr("\(.*?\)", text))[[1]])
    }
    
pull_cites("This is a test. I only want to select the (cites) in parenthesis. I do not want it to return words in 
    parenthesis that do not have years attached, such as abbreviations (abbr). For example, citing (Smith 2010) is 
    something I would want to be returned. I would also want multiple citations returned separately such as 
    (Smith 2010; Jones 2001; Brown 2020). I would also want Cooper (2015) returned as Cooper 2015, and not just 2015.")

在这个例子中,它 returns

[1] "cites"                              "abbr"                               "Smith 2010"                        
[4] "Smith 2010; Jones 2001; Brown 2020" "2015"

但我希望它 return 类似于:

[1] "Smith 2010"
[2] "Smith 2010"                
[3] "Jones 2001"
[4] "Brown 2020"
[5] "Cooper 2015"

关于如何使此功能更具体的任何想法?我正在使用 R。谢谢!

使用一些 not-so-difficult 正则表达式,我们可以执行以下操作:

library(tidyverse)

pull_cites <- function (text) {
  str_extract_all(text, "(?<=\()[A-Z][a-z][^()]* [12][0-9]{3}(?=\))|[A-Z][a-z]+ \([12][0-9]{3}[^()]*", simplify = T) %>% 
    gsub("\(", "", .) %>% 
    str_split(., "; ") %>% 
    unlist()
}

pull_cites("This is a test. I only want to select the (cites) in parenthesis. 
            I do not want it to return words in parenthesis that do not have years attached, 
            such as abbreviations (abbr). For example, citing (Smith 2010) is something I would 
            want to be returned. I would also want multiple citations returned separately such 
            as (Smith 2010; Jones 2001; Brown 2020). I would also want Cooper (2015) returned 
            as Cooper 2015, and not just 2015. other aspects of life 
            history (Nye et al. 2010; Runge et al. 2010; Lesser 2016). In the Gulf of Maine, 
            annual sea surface temperature (SST) averages have increased a total of roughly 1.6 °C 
            since 1895 (Fernandez et al. 2020)")

[1] "Smith 2010"            "Smith 2010"           
[3] "Jones 2001"            "Brown 2020"           
[5] "Cooper 2015"           "Nye et al. 2010"      
[7] "Runge et al. 2010"     "Lesser 2016"          
[9] "Fernandez et al. 2020"

正则表达式解释 str_extract_all():

  • (?<=\() 匹配左括号 ( 后的一个字符(R 中的双重转义 \
  • [A-Z][a-z][^()]* 匹配一个大写字母后跟一个小写字母后跟一个或多个非方括号的字符([^()*] 由@WiktorStribiżew 提供)
  • (?=\)) 匹配右括号前的一个字符 )
  • [12][0-9]{3} 匹配年份,我假设年份以 1 或 2 开头,然后再跟 3 个数字

下面的正则表达式是为了匹配特殊情况与模式 Copper (2015):

  • [A-Z][a-z]+ \([12][0-9]{3}[^()]* 匹配任何包含大写字母后跟 1 个以上小写字母后跟空 space 后跟左括号 ( 后跟 "我在上面定义的年份”

您也可以使用

x <- "This is a test. I only want to select the (cites) in parenthesis. I do not want it to return words in parenthesis that do not have years attached, such as abbreviations (abbr). For example, citing (Smith 2010) is something I would want to be returned. I would also want multiple citations returned separately such as (Smith 2010; Jones 2001; Brown 2020). I would also want Cooper (2015) returned as Cooper 2015, and not just 2015."
rx <- "(?:\b(\p{Lu}\w*(?:\s+\p{Lu}\w*)*))?\s*\(([^()]*\d{4})\)"
library(stringr)
res <- str_match_all(x, rx)
result <- lapply(res, function(z) {ifelse(!is.na(z[,2]) & str_detect(z[,3],"^\d+$"), paste(trimws(z[,2]),  trimws(z[,3])), z[,3])})    
unlist(sapply(result, function(z) strsplit(paste(z, collapse=";"), "\s*;\s*")))
## -> [1] "Smith 2010"  "Smith 2010"  "Jones 2001"  "Brown 2020"  "Cooper 2015"

参见R demo and the regex demo

正则表达式匹配

  • (?:\b(\p{Lu}\w*(?:\s+\p{Lu}\w*)*))? - 一个可选的序列
    • \b - 单词边界
    • (\p{Lu}\w*(?:\s+\p{Lu}\w*)*) - 第 1 组:一个大写字母后跟零个或多个单词字符,然后是零个或多个由一个或多个空格组成的序列,然后是一个大写字母后跟零个或多个单词字符
  • \s* - 零个或多个空格
  • \( - 一个 ( 字符
  • ([^()]*\d{4}) - 第 2 组:除 () 之外的任何零个或多个字符,然后是四位数字
  • \) - ) 个字符。

str_match_all(x, rx) 函数查找所有匹配项并保留捕获的子字符串。然后,如果第 2 组不是 NA 且第 3 组全是数字,则连接第 2 组和第 3 组的值,否则,将按原样使用匹配项。稍后,res 变量中的项目用 ; 字符连接并用 ; 拆分(用零个或多个空格括起来)。