如何从文章中提取文本下方和上方的关键字
How to extract keywords below and above a text from an article
我有这个来自日记的行的特征向量:
test_1 <- c(" Journal of Neonatal Nursing 27 (2021) 106–110",
" Contents lists available at ScienceDirect",
" Journal of Neonatal Nursing",
" journal homepage: www.elsevier.com/locate/jnn",
"Comparison of inter-facility transports of critically ill neonates who died",
"after admission vs. survivors", "Robert Schultz a, *, Jennifer Berk-King a, Laura Wallace a, Girija Natarajan a, b",
"a", " Children’s Hospital of Michigan, Detroit, MI, USA",
"b", " Division of Neonatology, Wayne State University School of Medicine, Detroit, MI, USA",
"A R T I C L E I N F O A B S T R A C T",
"Keywords: Objective: To compare characteristics before, during and after inter-facility transports (IFT), and changes in the",
"Inter-facility transport Transport Risk Index of Physiologic Stability (TRIPS) before and after inter-facility transports (IFT) in infants",
"Neonatal intensive care who died within 7 days of admission to a level IV NICU versus matched survivors.",
"Mortality", " Study design: This retrospective case-control study included infants who died within 7 days of IFT and controls",
" matched for gestational age and reason for admission. Unplanned events were temperature or respiratory de",
" rangements. Therapeutic interventions included increased respiratory support, resuscitation or blood product",
" transfusion.",
" Results: Our cohort was predominantly preterm and male. Cases had a higher rate of resuscitation, lower Apgar",
" scores, more respiratory acidosis, lower BP and higher TRIPS, compared to controls. Deterioration in TRIPS was",
" independently associated with male gender and unplanned events; not with patient group.",
" Conclusions: Rates of unplanned events, therapeutic interventions, and deterioration in TRIPS following IFT by a",
" transport team are comparable in cases and controls.",
" outcomes. The Transport Risk Index of Physiologic Stability (TRIPS) is",
"1. Introduction an assessment measure of infant status before and after transport (Lee"
)
我想从这些行中提取关键字,它们是 Inter-facility transport
、Neonatal intensive care
、Mortality
。我试图获取包含 test_1[str_detect(test_1, "^Keywords:")]
的“关键字”的行 我想获取此行下方和 1. Introduction
以上的所有关键字
哪些 regex
或 stringr
函数可以做到这一点?
谢谢
如果我没理解错的话,您是在扫描从 here 下载的 pdf。我认为您应该找到一种更好的方式来扫描您的 PDF。
到那时,最好的选择可能是:
library(stringr)
# get the line after ^Keywords:
start <- which(str_detect(test_1, "^Keywords:")) +1
# get the line before ^1. Introduction
end <- which(str_detect(test_1, "^1. Introduction")) -1
# get the lines in between
x <- test_1[start:end]
# Extract keywords
x <- str_trim(str_sub(x, 1, 60))
x <- x[x!=""]
x
#> [1] "Inter-facility transport" "Neonatal intensive care" "Mortality"
编辑:
您可以定义一个函数来查找出现 Keywords
的行的索引以及该行下方的行的索引:
find_keywords <- function(pattern, text) {
index <- which(grepl(pattern, text))
sort(c(index + 1, index + 2, index + 3)) # If you suspect there are more than three keywords, then just `index + ...`
}
基于该函数,可以提取关键词:
library(stringr)
str_extract(test_1[find_keywords(pattern = "^Keywords:", text = test_1)], "^\S+")
[1] "Inter-facility" "Neonatal" "Mortality"
我有这个来自日记的行的特征向量:
test_1 <- c(" Journal of Neonatal Nursing 27 (2021) 106–110",
" Contents lists available at ScienceDirect",
" Journal of Neonatal Nursing",
" journal homepage: www.elsevier.com/locate/jnn",
"Comparison of inter-facility transports of critically ill neonates who died",
"after admission vs. survivors", "Robert Schultz a, *, Jennifer Berk-King a, Laura Wallace a, Girija Natarajan a, b",
"a", " Children’s Hospital of Michigan, Detroit, MI, USA",
"b", " Division of Neonatology, Wayne State University School of Medicine, Detroit, MI, USA",
"A R T I C L E I N F O A B S T R A C T",
"Keywords: Objective: To compare characteristics before, during and after inter-facility transports (IFT), and changes in the",
"Inter-facility transport Transport Risk Index of Physiologic Stability (TRIPS) before and after inter-facility transports (IFT) in infants",
"Neonatal intensive care who died within 7 days of admission to a level IV NICU versus matched survivors.",
"Mortality", " Study design: This retrospective case-control study included infants who died within 7 days of IFT and controls",
" matched for gestational age and reason for admission. Unplanned events were temperature or respiratory de",
" rangements. Therapeutic interventions included increased respiratory support, resuscitation or blood product",
" transfusion.",
" Results: Our cohort was predominantly preterm and male. Cases had a higher rate of resuscitation, lower Apgar",
" scores, more respiratory acidosis, lower BP and higher TRIPS, compared to controls. Deterioration in TRIPS was",
" independently associated with male gender and unplanned events; not with patient group.",
" Conclusions: Rates of unplanned events, therapeutic interventions, and deterioration in TRIPS following IFT by a",
" transport team are comparable in cases and controls.",
" outcomes. The Transport Risk Index of Physiologic Stability (TRIPS) is",
"1. Introduction an assessment measure of infant status before and after transport (Lee"
)
我想从这些行中提取关键字,它们是 Inter-facility transport
、Neonatal intensive care
、Mortality
。我试图获取包含 test_1[str_detect(test_1, "^Keywords:")]
的“关键字”的行 我想获取此行下方和 1. Introduction
哪些 regex
或 stringr
函数可以做到这一点?
谢谢
如果我没理解错的话,您是在扫描从 here 下载的 pdf。我认为您应该找到一种更好的方式来扫描您的 PDF。
到那时,最好的选择可能是:
library(stringr)
# get the line after ^Keywords:
start <- which(str_detect(test_1, "^Keywords:")) +1
# get the line before ^1. Introduction
end <- which(str_detect(test_1, "^1. Introduction")) -1
# get the lines in between
x <- test_1[start:end]
# Extract keywords
x <- str_trim(str_sub(x, 1, 60))
x <- x[x!=""]
x
#> [1] "Inter-facility transport" "Neonatal intensive care" "Mortality"
编辑:
您可以定义一个函数来查找出现 Keywords
的行的索引以及该行下方的行的索引:
find_keywords <- function(pattern, text) {
index <- which(grepl(pattern, text))
sort(c(index + 1, index + 2, index + 3)) # If you suspect there are more than three keywords, then just `index + ...`
}
基于该函数,可以提取关键词:
library(stringr)
str_extract(test_1[find_keywords(pattern = "^Keywords:", text = test_1)], "^\S+")
[1] "Inter-facility" "Neonatal" "Mortality"