如何从文章中提取文本下方和上方的关键字

Question

我有这个来自日记的行的特征向量：

test_1 <- c("                                                                  Journal of Neonatal Nursing 27 (2021) 106–110", 
"                                                                     Contents lists available at ScienceDirect", 
"                                                               Journal of Neonatal Nursing", 
"                                                              journal homepage: www.elsevier.com/locate/jnn", 
"Comparison of inter-facility transports of critically ill neonates who died", 
"after admission vs. survivors", "Robert Schultz a, *, Jennifer Berk-King a, Laura Wallace a, Girija Natarajan a, b", 
"a", "  Children’s Hospital of Michigan, Detroit, MI, USA", 
"b", "  Division of Neonatology, Wayne State University School of Medicine, Detroit, MI, USA", 
"A R T I C L E I N F O                                       A B S T R A C T", 
"Keywords:                                                   Objective: To compare characteristics before, during and after inter-facility transports (IFT), and changes in the", 
"Inter-facility transport                                    Transport Risk Index of Physiologic Stability (TRIPS) before and after inter-facility transports (IFT) in infants", 
"Neonatal intensive care                                     who died within 7 days of admission to a level IV NICU versus matched survivors.", 
"Mortality", "                                                            Study design: This retrospective case-control study included infants who died within 7 days of IFT and controls", 
"                                                            matched for gestational age and reason for admission. Unplanned events were temperature or respiratory de", 
"                                                            rangements. Therapeutic interventions included increased respiratory support, resuscitation or blood product", 
"                                                            transfusion.", 
"                                                            Results: Our cohort was predominantly preterm and male. Cases had a higher rate of resuscitation, lower Apgar", 
"                                                            scores, more respiratory acidosis, lower BP and higher TRIPS, compared to controls. Deterioration in TRIPS was", 
"                                                            independently associated with male gender and unplanned events; not with patient group.", 
"                                                            Conclusions: Rates of unplanned events, therapeutic interventions, and deterioration in TRIPS following IFT by a", 
"                                                            transport team are comparable in cases and controls.", 
"                                                                                              outcomes. The Transport Risk Index of Physiologic Stability (TRIPS) is", 
"1. Introduction                                                                               an assessment measure of infant status before and after transport (Lee"
)

我想从这些行中提取关键字，它们是 Inter-facility transport、Neonatal intensive care、Mortality。我试图获取包含 test_1[str_detect(test_1, "^Keywords:")] 的“关键字”的行我想获取此行下方和 1. Introduction

以上的所有关键字

哪些 regex 或 stringr 函数可以做到这一点？

谢谢

Answer 1

如果我没理解错的话，您是在扫描从 here 下载的 pdf。我认为您应该找到一种更好的方式来扫描您的 PDF。

到那时，最好的选择可能是：

library(stringr)

# get the line after ^Keywords:
start <- which(str_detect(test_1, "^Keywords:")) +1

# get the line before ^1. Introduction
end <- which(str_detect(test_1, "^1. Introduction")) -1

# get the lines in between
x <- test_1[start:end]

# Extract keywords
x <- str_trim(str_sub(x, 1, 60))
x <- x[x!=""]
x
#> [1] "Inter-facility transport" "Neonatal intensive care"  "Mortality"

Answer 2

编辑:

您可以定义一个函数来查找出现 Keywords 的行的索引以及该行下方的行的索引：

find_keywords <- function(pattern, text) {
  index <- which(grepl(pattern, text))  
  sort(c(index + 1, index + 2, index + 3)) # If you suspect there are more than three keywords, then just `index + ...`
}

基于该函数，可以提取关键词：

library(stringr)
str_extract(test_1[find_keywords(pattern = "^Keywords:", text = test_1)], "^\S+")
[1] "Inter-facility" "Neonatal"       "Mortality"

如何从文章中提取文本下方和上方的关键字

How to extract keywords below and above a text from an article

regex

r

stringr