在 R 中将 HTML 解析为具有 Div 级别的文本
Parse HTML into text with Div level in R
library(XML)
html <- read_html("https://www.sec.gov/Archives/edgar/data/1011290/000114036105007405/body.htm")
doc.html = htmlTreeParse(html, useInternal = TRUE)
doc.text = unlist(xpathApply(doc.html, '//div', xmlValue))
上面的代码因为divlevel/structure读取了两次文本,我只需要读取一次文本。感谢您的时间和帮助。即
doc.text[2] # contains all the text which repeats again in 3 to 59
试试这个:
library(rvest)
library(tidyverse)
html <- read_html("https://www.sec.gov/Archives/edgar/data/1011290/000114036105007405/body.htm")
text <- html %>%
html_nodes(xpath = "//text/div") %>%
html_text(trim = TRUE) %>%
paste( collapse = ' ')
library(XML)
html <- read_html("https://www.sec.gov/Archives/edgar/data/1011290/000114036105007405/body.htm")
doc.html = htmlTreeParse(html, useInternal = TRUE)
doc.text = unlist(xpathApply(doc.html, '//div', xmlValue))
上面的代码因为divlevel/structure读取了两次文本,我只需要读取一次文本。感谢您的时间和帮助。即
doc.text[2] # contains all the text which repeats again in 3 to 59
试试这个:
library(rvest)
library(tidyverse)
html <- read_html("https://www.sec.gov/Archives/edgar/data/1011290/000114036105007405/body.htm")
text <- html %>%
html_nodes(xpath = "//text/div") %>%
html_text(trim = TRUE) %>%
paste( collapse = ' ')