在 R 中,如何匹配降价列表
In R, how do I match markdown list
我正在尝试匹配以下有序列表和无序列表并提取 bullet/list 点。
library(stringr)
examples <- c(
"* Bullet 1\n* Bullet 2\n* Bullet 3",
"1. Bullet 1\n2. Bullet 2\n3. Bullet 3",
"* This is a test 1\n* This is a test with some *formatting*\n* This is a test with different _formatting_"
)
我想做的是:
- 以编程方式识别它是一个列表
- 将每个解析为列表项的文本
结果会是
some_str_fun(example,pattern) # or multiples
"Bullet 1" "Bullet 2" "Bullet 3"
"Bullet 1" "Bullet 2" "Bullet 3"
"This is a test 1" "This is a test with some *formatting*"
"This is a test with different _formatting_"
我一直在研究以下模式,str_extract/match但似乎找不到完全可用的东西
[*]+\s(.*?)[\n]* # for * Bullet X\n
[1-9]+[.]\s(.*?)[\n]* # for 1. Bullet X\n
我对这些模式进行了一系列不同的迭代,但似乎无法完全找到我要找的东西。
这是一种有点不同的方法,但是如果您将降价渲染为 HTML,您可以使用一些现有的提取方法来执行您想要的操作:
library(stringr)
examples <- c(
"* Bullet 1\n* Bullet 2\n* Bullet 3",
"1. Bullet 1\n2. Bullet 2\n3. Bullet 3",
"* This is a test 1\n* This is a test with some *formatting*\n* This is a test with different _formatting_"
)
extract_md_list <- function(md_text) {
require(rvest)
require(rmarkdown)
fil_md <- tempfile()
fil_html <- tempfile()
writeLines(md_text, con=fil_md)
render(fil_md, output_format="html_document", output_file=fil_html, quiet=TRUE)
pg <- html(fil_html)
ret <- html_nodes(pg, "li") %>% html_text()
# cleanup
unlink(fil_md)
unlink(fil_html)
return(ret)
}
extract_md_list(examples)
## [1] "Bullet 1"
## [2] "Bullet 2"
## [3] "Bullet 3"
## [4] "Bullet 1"
## [5] "Bullet 2"
## [6] "Bullet 3"
## [7] "This is a test 1"
## [8] "This is a test with some formatting"
## [9] "This is a test with different formatting"
这是另一种选择。如果需要,您可以包装在 unlist
中:
str_extract_all(examples, "[^*1-9\n ]\w+( ?[\w*]+)*")
# or
#str_extract_all(examples, "[^*1-9\n ]\w+( ?[a-zA-Z0-9_*]+)*")
#[[1]]
#[1] "Bullet 1" "Bullet 2" "Bullet 3"
#
#[[2]]
#[1] "Bullet 1" "Bullet 2" "Bullet 3"
#
#[[3]]
#[1] "This is a test 1"
#[2] "This is a test with some *formatting*"
#[3] "This is a test with different _formatting_"
还有其他几个选项,特别是如果您不关心在单个正则表达式或单行代码中获取所有内容。这是另一种方法。正则表达式更简单,但您最终会得到 ""
,这需要额外的行:
splits <- unlist(str_split(examples, "\n|\d+\. |\* "))
splits[splits != ""]
#[1] "Bullet 1"
#[2] "Bullet 2"
#[3] "Bullet 3"
#[4] "Bullet 1"
#[5] "Bullet 2"
#[6] "Bullet 3"
#[7] "This is a test 1"
#[8] "This is a test with some *formatting*"
#[9] "This is a test with different _formatting_"
您可以使用 gsubfn 包中的 strapply
来匹配整个模式。
library(gsubfn)
examples <- c(
"* Bullet 1\n* Bullet 2\n* Bullet 3",
"1. Bullet 1\n2. Bullet 2\n3. Bullet 3",
"* This is a test 1\n* This is a test with some *formatting*\n* This is a test with different _formatting_"
)
strapply(examples, '(?:\*|\d+\.) *([^\n]+)', c, simplify = c)
# [1] "Bullet 1"
# [2] "Bullet 2"
# [3] "Bullet 3"
# [4] "Bullet 1"
# [5] "Bullet 2"
# [6] "Bullet 3"
# [7] "This is a test 1"
# [8] "This is a test with some *formatting*"
# [9] "This is a test with different _formatting_"
我正在尝试匹配以下有序列表和无序列表并提取 bullet/list 点。
library(stringr)
examples <- c(
"* Bullet 1\n* Bullet 2\n* Bullet 3",
"1. Bullet 1\n2. Bullet 2\n3. Bullet 3",
"* This is a test 1\n* This is a test with some *formatting*\n* This is a test with different _formatting_"
)
我想做的是:
- 以编程方式识别它是一个列表
- 将每个解析为列表项的文本
结果会是
some_str_fun(example,pattern) # or multiples
"Bullet 1" "Bullet 2" "Bullet 3"
"Bullet 1" "Bullet 2" "Bullet 3"
"This is a test 1" "This is a test with some *formatting*"
"This is a test with different _formatting_"
我一直在研究以下模式,str_extract/match但似乎找不到完全可用的东西
[*]+\s(.*?)[\n]* # for * Bullet X\n
[1-9]+[.]\s(.*?)[\n]* # for 1. Bullet X\n
我对这些模式进行了一系列不同的迭代,但似乎无法完全找到我要找的东西。
这是一种有点不同的方法,但是如果您将降价渲染为 HTML,您可以使用一些现有的提取方法来执行您想要的操作:
library(stringr)
examples <- c(
"* Bullet 1\n* Bullet 2\n* Bullet 3",
"1. Bullet 1\n2. Bullet 2\n3. Bullet 3",
"* This is a test 1\n* This is a test with some *formatting*\n* This is a test with different _formatting_"
)
extract_md_list <- function(md_text) {
require(rvest)
require(rmarkdown)
fil_md <- tempfile()
fil_html <- tempfile()
writeLines(md_text, con=fil_md)
render(fil_md, output_format="html_document", output_file=fil_html, quiet=TRUE)
pg <- html(fil_html)
ret <- html_nodes(pg, "li") %>% html_text()
# cleanup
unlink(fil_md)
unlink(fil_html)
return(ret)
}
extract_md_list(examples)
## [1] "Bullet 1"
## [2] "Bullet 2"
## [3] "Bullet 3"
## [4] "Bullet 1"
## [5] "Bullet 2"
## [6] "Bullet 3"
## [7] "This is a test 1"
## [8] "This is a test with some formatting"
## [9] "This is a test with different formatting"
这是另一种选择。如果需要,您可以包装在 unlist
中:
str_extract_all(examples, "[^*1-9\n ]\w+( ?[\w*]+)*")
# or
#str_extract_all(examples, "[^*1-9\n ]\w+( ?[a-zA-Z0-9_*]+)*")
#[[1]]
#[1] "Bullet 1" "Bullet 2" "Bullet 3"
#
#[[2]]
#[1] "Bullet 1" "Bullet 2" "Bullet 3"
#
#[[3]]
#[1] "This is a test 1"
#[2] "This is a test with some *formatting*"
#[3] "This is a test with different _formatting_"
还有其他几个选项,特别是如果您不关心在单个正则表达式或单行代码中获取所有内容。这是另一种方法。正则表达式更简单,但您最终会得到 ""
,这需要额外的行:
splits <- unlist(str_split(examples, "\n|\d+\. |\* "))
splits[splits != ""]
#[1] "Bullet 1"
#[2] "Bullet 2"
#[3] "Bullet 3"
#[4] "Bullet 1"
#[5] "Bullet 2"
#[6] "Bullet 3"
#[7] "This is a test 1"
#[8] "This is a test with some *formatting*"
#[9] "This is a test with different _formatting_"
您可以使用 gsubfn 包中的 strapply
来匹配整个模式。
library(gsubfn)
examples <- c(
"* Bullet 1\n* Bullet 2\n* Bullet 3",
"1. Bullet 1\n2. Bullet 2\n3. Bullet 3",
"* This is a test 1\n* This is a test with some *formatting*\n* This is a test with different _formatting_"
)
strapply(examples, '(?:\*|\d+\.) *([^\n]+)', c, simplify = c)
# [1] "Bullet 1"
# [2] "Bullet 2"
# [3] "Bullet 3"
# [4] "Bullet 1"
# [5] "Bullet 2"
# [6] "Bullet 3"
# [7] "This is a test 1"
# [8] "This is a test with some *formatting*"
# [9] "This is a test with different _formatting_"