在 R 中,如何匹配降价列表

In R, how do I match markdown list

我正在尝试匹配以下有序列表和无序列表并提取 bullet/list 点。

library(stringr)
examples <- c(
"* Bullet 1\n* Bullet 2\n* Bullet 3",
"1. Bullet 1\n2. Bullet 2\n3. Bullet 3",
"* This is a test 1\n* This is a test with some *formatting*\n* This is a test with different _formatting_"
)

我想做的是:

  1. 以编程方式识别它是一个列表
  2. 将每个解析为列表项的文本

结果会是

some_str_fun(example,pattern) # or multiples
"Bullet 1" "Bullet 2" "Bullet 3"
"Bullet 1" "Bullet 2" "Bullet 3"
"This is a test 1" "This is a test with some *formatting*" 
"This is a test with different _formatting_"

我一直在研究以下模式,str_extract/match但似乎找不到完全可用的东西

[*]+\s(.*?)[\n]* # for * Bullet X\n
[1-9]+[.]\s(.*?)[\n]* # for 1. Bullet X\n

我对这些模式进行了一系列不同的迭代,但似乎无法完全找到我要找的东西。

这是一种有点不同的方法,但是如果您将降价渲染为 HTML,您可以使用一些现有的提取方法来执行您想要的操作:

library(stringr)

examples <- c(
"* Bullet 1\n* Bullet 2\n* Bullet 3",
"1. Bullet 1\n2. Bullet 2\n3. Bullet 3",
"* This is a test 1\n* This is a test with some *formatting*\n* This is a test with different _formatting_"
)

extract_md_list <- function(md_text) {

  require(rvest)
  require(rmarkdown)

  fil_md <- tempfile()
  fil_html <- tempfile()
  writeLines(md_text, con=fil_md)

  render(fil_md, output_format="html_document", output_file=fil_html, quiet=TRUE)

  pg <- html(fil_html)
  ret <- html_nodes(pg, "li") %>% html_text()

  # cleanup
  unlink(fil_md)
  unlink(fil_html)

  return(ret)

}

extract_md_list(examples)

## [1] "Bullet 1"                                
## [2] "Bullet 2"                                
## [3] "Bullet 3"                                
## [4] "Bullet 1"                                
## [5] "Bullet 2"                                
## [6] "Bullet 3"                                
## [7] "This is a test 1"                        
## [8] "This is a test with some formatting"     
## [9] "This is a test with different formatting"

这是另一种选择。如果需要,您可以包装在 unlist 中:

str_extract_all(examples, "[^*1-9\n ]\w+( ?[\w*]+)*")
# or 
#str_extract_all(examples, "[^*1-9\n ]\w+( ?[a-zA-Z0-9_*]+)*")

#[[1]]
#[1] "Bullet 1" "Bullet 2" "Bullet 3"
#
#[[2]]
#[1] "Bullet 1" "Bullet 2" "Bullet 3"
#
#[[3]]
#[1] "This is a test 1"                          
#[2] "This is a test with some *formatting*"     
#[3] "This is a test with different _formatting_"

还有其他几个选项,特别是如果您不关心在单个正则表达式或单行代码中获取所有内容。这是另一种方法。正则表达式更简单,但您最终会得到 "",这需要额外的行:

splits <- unlist(str_split(examples, "\n|\d+\. |\* "))
splits[splits != ""]
#[1] "Bullet 1"                                  
#[2] "Bullet 2"                                  
#[3] "Bullet 3"                                  
#[4] "Bullet 1"                                  
#[5] "Bullet 2"                                  
#[6] "Bullet 3"                                  
#[7] "This is a test 1"                          
#[8] "This is a test with some *formatting*"     
#[9] "This is a test with different _formatting_"

您可以使用 gsubfn 包中的 strapply 来匹配整个模式。

library(gsubfn)

examples <- c(
    "* Bullet 1\n* Bullet 2\n* Bullet 3",
    "1. Bullet 1\n2. Bullet 2\n3. Bullet 3",
    "* This is a test 1\n* This is a test with some *formatting*\n* This is a test with different _formatting_"
)

strapply(examples, '(?:\*|\d+\.) *([^\n]+)', c, simplify = c)

# [1] "Bullet 1"                                  
# [2] "Bullet 2"                                  
# [3] "Bullet 3"                                  
# [4] "Bullet 1"                                  
# [5] "Bullet 2"                                  
# [6] "Bullet 3"                                  
# [7] "This is a test 1"                          
# [8] "This is a test with some *formatting*"     
# [9] "This is a test with different _formatting_"