如何仅从 R 中的网页中抓取一段文本？

Question

我正在尝试抓取基于 html 的期刊文章的特定部分。例如，如果我只想抓取 Frontiers 出版物中文章的“统计分析”部分，我该怎么做？由于每篇文章的段落数和该部分的位置都会发生变化，因此 selectorGadget 无济于事。

https://www.frontiersin.org/articles/10.3389/fnagi.2010.00032/full

我试过将 rvest 与 html_nodes 和 xpath 一起使用，但我没有任何运气。我能做的最好的事情就是在我想要的部分开始抓取，但之后无法停止。有什么建议吗？

example_page <- "https://www.frontiersin.org/articles/10.3389/fnagi.2010.00032/full"
example_stats_section <- read_html(example_page) %>% 
html_nodes(xpath="//h3[contains(., 'Statistical Analyses')]/following-sibling::p") %>%
html_text()

Answer 1

因为每个“统计分析”尝试后都有一个“结果”部分

//h3[.='Statistical Analyses']/following-sibling::p[following::h2[.="Results"]]

获取所需部分

如何仅从 R 中的网页中抓取一段文本？

How do I scrape only one section of text from a webpage in R?

xpath

r

web-scraping

rvest