从文件中自动提取章节(和章节标题)
Automatically extracting Sections (and section Titles) from a file
我需要从 .Rmd 文件中提取所有小节(用于进一步的文本分析)及其标题(例如,来自 tidy-text-mining 本书的 01-tidy-text.Rmd
:
https://raw.githubusercontent.com/dgrtwo/tidy-text-mining/master/01-tidy-text.Rmd)
据我所知,一个部分从 ##
符号开始,一直运行到下一个 #
、##
符号或文件末尾。
整个文本已被提取(使用 dt <- readtext("01-tidy-text.Rmd"); strEntireText <-dt[1,1]
)并且位于变量 strEntireText
.
我想为此使用 stringr
。或 stringi
,大致如下:
strAllSections <- str_extract(strEntireText , pattern="...")
strAllSectionsTitles <- str_extract(strEntireText , pattern="...")
请提出您的解决方案。谢谢
本练习的最后 objective 是能够从 .Rmd 文件自动创建 data.frame,其中每一行对应于每一节(和子节),列包含:节标题、部分标签、部分文本本身,以及一些其他 section-specific 细节,稍后将提取这些细节。
这是一个使用 tidyverse
方法的示例。这不一定适用于您拥有的任何文件——如果您正在使用降价,您可能应该尝试找到合适的降价解析库,正如 Spacedman 在他的评论中提到的那样。
library(tidyverse)
## A df where each line is a row in the rmd file.
raw <- data_frame(
text = read_lines("https://raw.githubusercontent.com/dgrtwo/tidy-text-mining/master/01-tidy-text.Rmd")
)
## We don't want to mark R comments as sections.
detect_codeblocks <- function(text) {
blocks <- text %>%
str_detect("```") %>%
cumsum()
blocks %% 2 != 0
}
## Here is an example of how you can extract information, such
## headers, using regex patterns.
df <-
raw %>%
mutate(
code_block = detect_codeblocks(text),
section = text %>%
str_match("^# .*") %>%
str_remove("^#+ +"),
section = ifelse(code_block, NA, section),
subsection = text %>%
str_match("^## .*") %>%
str_remove("^#+ +"),
subsection = ifelse(code_block, NA, subsection),
) %>%
fill(section, subsection)
## If you wish to glue the text together within sections/subsections,
## then just group by them and flatten the text.
df %>%
group_by(section, subsection) %>%
slice(-1) %>% # remove the header
summarize(
text = text %>%
str_flatten(" ") %>%
str_trim()
) %>%
ungroup()
#> # A tibble: 7 x 3
#> section subsection text
#> <chr> <chr> <chr>
#> 1 The tidy text format {#tidytext} Contrastin… "As we stated above, we de…
#> 2 The tidy text format {#tidytext} Summary In this chapter, we explor…
#> 3 The tidy text format {#tidytext} The `unnes… "Emily Dickinson wrote som…
#> 4 The tidy text format {#tidytext} The gutenb… "Now that we've used the j…
#> 5 The tidy text format {#tidytext} Tidying th… "Let's use the text of Jan…
#> 6 The tidy text format {#tidytext} Word frequ… "A common task in text min…
#> 7 The tidy text format {#tidytext} <NA> "```{r echo = FALSE} libra…
我需要从 .Rmd 文件中提取所有小节(用于进一步的文本分析)及其标题(例如,来自 tidy-text-mining 本书的 01-tidy-text.Rmd
:
https://raw.githubusercontent.com/dgrtwo/tidy-text-mining/master/01-tidy-text.Rmd)
据我所知,一个部分从 ##
符号开始,一直运行到下一个 #
、##
符号或文件末尾。
整个文本已被提取(使用 dt <- readtext("01-tidy-text.Rmd"); strEntireText <-dt[1,1]
)并且位于变量 strEntireText
.
我想为此使用 stringr
。或 stringi
,大致如下:
strAllSections <- str_extract(strEntireText , pattern="...")
strAllSectionsTitles <- str_extract(strEntireText , pattern="...")
请提出您的解决方案。谢谢
本练习的最后 objective 是能够从 .Rmd 文件自动创建 data.frame,其中每一行对应于每一节(和子节),列包含:节标题、部分标签、部分文本本身,以及一些其他 section-specific 细节,稍后将提取这些细节。
这是一个使用 tidyverse
方法的示例。这不一定适用于您拥有的任何文件——如果您正在使用降价,您可能应该尝试找到合适的降价解析库,正如 Spacedman 在他的评论中提到的那样。
library(tidyverse)
## A df where each line is a row in the rmd file.
raw <- data_frame(
text = read_lines("https://raw.githubusercontent.com/dgrtwo/tidy-text-mining/master/01-tidy-text.Rmd")
)
## We don't want to mark R comments as sections.
detect_codeblocks <- function(text) {
blocks <- text %>%
str_detect("```") %>%
cumsum()
blocks %% 2 != 0
}
## Here is an example of how you can extract information, such
## headers, using regex patterns.
df <-
raw %>%
mutate(
code_block = detect_codeblocks(text),
section = text %>%
str_match("^# .*") %>%
str_remove("^#+ +"),
section = ifelse(code_block, NA, section),
subsection = text %>%
str_match("^## .*") %>%
str_remove("^#+ +"),
subsection = ifelse(code_block, NA, subsection),
) %>%
fill(section, subsection)
## If you wish to glue the text together within sections/subsections,
## then just group by them and flatten the text.
df %>%
group_by(section, subsection) %>%
slice(-1) %>% # remove the header
summarize(
text = text %>%
str_flatten(" ") %>%
str_trim()
) %>%
ungroup()
#> # A tibble: 7 x 3
#> section subsection text
#> <chr> <chr> <chr>
#> 1 The tidy text format {#tidytext} Contrastin… "As we stated above, we de…
#> 2 The tidy text format {#tidytext} Summary In this chapter, we explor…
#> 3 The tidy text format {#tidytext} The `unnes… "Emily Dickinson wrote som…
#> 4 The tidy text format {#tidytext} The gutenb… "Now that we've used the j…
#> 5 The tidy text format {#tidytext} Tidying th… "Let's use the text of Jan…
#> 6 The tidy text format {#tidytext} Word frequ… "A common task in text min…
#> 7 The tidy text format {#tidytext} <NA> "```{r echo = FALSE} libra…