使用 R 跨多行查找模式

Find a Pattern across Multiple lines with R

我正在尝试识别跨多行的模式,确切地说是 2 行。由于任何一行中的模式都不是唯一的,我正在使用这种方法。

到目前为止,我已经尝试使用函数 "grep",但我想我在这里缺少正确的正则表达式。

grep("^Item\s{0,}2[^A]", f.text, ignore.case = TRUE)

这部分是 edgar 包函数 "getfillings" 的修改版本,并尝试仅提取管理层的 Comment/Item 2 季度业绩。如果可能的话,我会在对新行做出反应的函数中... 2[^A] 之后包含一些内容,然后是字符串 "Management..."

我拥有的纯文本中的模式如下所示:

项目 2.
财务状况及经营成果管理层讨论与分析

对于如何使用 R 在正则表达式中最好地捕捉这一点,我将不胜感激。

示例输入如下所示:

21 项目 2.
管理层对财务状况和经营成果的讨论与分析 表 10 本季度报告的本节和其他部分 项目 3.
关于市场风险的定量和定性披露 公司市场风险没有material变化

所需的输出为

管理层对财务状况和经营业绩的讨论与分析 表 10

本季度报告的本节和其他部分

我需要匹配 "Item 2. ... Management Discussion" 因为项目 2 不是唯一的。我如何制定跨两行的正则表达式?

不是很复杂,因为我不是字符串操作方面的专家:使用包 tidyverse 提供了一些强大的工具来获得您想要的输出。

text <- "21 Item 2.
Management Discussion and Analysis of Financial Condition and Results of Operations This section and other parts of this Quarterly Report on Form 10 Item 3.
Quantitative and Qualitative Disclosures About Market Risk There have been no material changes to the Company market risk Item 4.
Fluffy Text example Item 5.
Lorem ipsum dolor sit amet, consectetur adipisici elit"

现在

text %>%
  str_extract_all("(?<=Item\s\d[[:punct:]]\n).*", simplify = TRUE) %>%
  str_remove("\s+Item\s\d[[:punct:]]")

给你

[1] "Management Discussion and Analysis of Financial Condition and Results of Operations This section and other parts of this Quarterly Report on Form 10"
[2] "Quantitative and Qualitative Disclosures About Market Risk There have been no material changes to the Company market risk"                           
[3] "Fluffy Text example"                                                                                                                                 
[4] "Lorem ipsum dolor sit amet, consectetur adipisici elit" 

如果您只想提取项目2,请将str_extract_all中的\d替换为2.

您可以简单地删除换行符:

gsub("\n", "", text)
[1] "21 Item 2.Management Discussion and Analysis of Financial Condition and Results of Operations This section and other parts of this Quarterly Report on Form 10 Item 3.Quantitative and Qualitative Disclosures About Market Risk There have been no material changes to the Company market risk"

现在您可以在一长行中包含所有内容,并且可以提取您想要的任何模式。例如,使用包 stringr:

中的 str_extract
library(stringr)
str_extract(gsub("\n", "", text), "Management.*on Form 10")
[1] "Management Discussion and Analysis of Financial Condition and Results of Operations This section and other parts of this Quarterly Report on Form 10"

数据:

text <- "21 Item 2.
Management Discussion and Analysis of Financial Condition and Results of Operations This section and other parts of this Quarterly Report on Form 10 Item 3.
Quantitative and Qualitative Disclosures About Market Risk There have been no material changes to the Company market risk"

text
[1] "21 Item 2.\nManagement Discussion and Analysis of Financial Condition and Results of Operations This section and other parts of this Quarterly Report on Form 10 Item 3.\nQuantitative and Qualitative Disclosures About Market Risk There have been no material changes to the Company market risk"