提取仅包含r中关键字的段落

Extract paragraphs only containing a keyword in r

所以在 R 中的文本文件中,我需要扫描许多文档以查找其中提到 "discount rate" 的段落。然后我想提取它所在的整个段落,而且只有那个。在文本文件中,每个段落前后都有一个空行,写为“”。我提供了一些我尝试过但不起作用的示例代码和 txt 文件的几段,其中一段包含关键字 "discount rate"

 t <- c(grep(" discount rate ",txt,ignore.case = T),grep(" discounted cash flow",txt,ignore.case = T))

  temp <- unlist(str_extract_all(txt,"\r\r. discount rate .\r\r"))

所以我的方法是尝试提取“”和“”之间的所有行,允许它们包含 "discount rate",但使用此代码显然不成功。


""
" (9) 属性 或设备的任何销售或处置
“已损坏、磨损、过时或无用或无用”
“ Armor Holdings 不再使用与 Armor 业务相关的服务”
" Holdings 或其受限制的子公司。"
""
" \"Attributable Debt\" 关于售后回租交易
"means, at the time of determination, the present value of the obligation of the"
"lessee for net rental payments during the remaining term of the lease included"
"in such sale and leaseback transaction, including any period for which such"
"lease has been extended or may, at the option of the lessor, be extended. Such"
"present value shall be calculated using a discount rate equal to the rate of"
"interest implicit in such transaction, determined in accordance with GAAP."
""
“\"Beneficial Owner\" 具有规则 13d-3 中赋予该术语的含义”
"and Rule 13d-5 under the Exchange Act, except that in calculating the beneficial"
"ownership of any particular \"人\"(因为该术语在第 13(d)(3) 节中使用)
"of the Exchange Act), such \"人\"应被视为拥有受益所有权
"of all securities that such \"人\"有权换算或
"exercise of other securities, whether such right is currently exercisable or is"
"exercisable only upon the occurrence of a subsequent condition. The terms"
“\"Beneficially Owns\" 和 \"Beneficially Owned\" 应具有相应的含义。” ""
" \"Board of Directors\" 表示:"
""
" (1) 就公司而言,
的董事会 “公司;”
""
" (2) 对于合伙企业,
的董事会 " 合伙企业的普通合伙人;以及"
""
" (3) 关于任何其他人、董事会或委员会"
" 担任类似职务的人。"
""

将您的文件另存为 text.txt,这对我有用:

data <- readLines("text.txt")
data[nchar(data)==0]="\n"
data = strsplit(paste(data,collapse=""),"\n")[[1]]
data[grepl("discount rate",data,ignore.case = T)]

我添加了 \n 作为新行的虚拟变量,因此我可以在 strsplit 参数中拆分它。该函数只有returns第二段。希望这对您有所帮助!

如果你不想改变文本中的换行符,你可以这样做(txt在你的问题中是一个字符串向量)

# generate a variable for paragraph number
df <- data.frame(txt, paragraph = cumsum(txt == "")) 
# find  paragraphs with the search term
keep_paragraph <- df[grep("discount rate", df[, "txt"]), "paragraph"] 
# subset the data.frame
df <- df[df$paragraph %in% keep_paragraph,]