如何使用 R 从 XML 中提取一行？

Question

我从这样的 link 中读取数据。

> library(XML)
> url <- "http://biostat.jhsph.edu/~jleek/contact.html"
> html <- htmlTreeParse(url, useInternalNodes=T)

然后我想从中提取第十行来计算它的字符数。我该怎么办？

Answer 1

你在找这个吗？找到html开头的第十行（id = main），提取其值，并计算提取内容中的字符数。

> url <- "http://biostat.jhsph.edu/~jleek/contact.html"
> html <- htmlTreeParse(url, useInternalNodes=T)
> xpathSApply(html, "//div[@id = 'main']", xmlValue, trim = TRUE)
[1] "Contact Information\n\n\t\t\t  Address \n\t\t\t  \n\t\t\t  Johns Hopkins University \n\t\t\t  Bloomberg School of Public Health \n\t\t\t  615 North Wolfe Street \n\t\t\t  Baltimore, MD 21205-2179 \n\t\t\t  Phone\n\t\t\t  410-955-1166 (I am much easier to reach by email)\n\t\t\t  Fax\n\t\t\t  410-955-0958\n\t\t\t  Email\n\t\t\t   jleek || jhsph dot edu \n\t\t\t  Twitter\n\t\t\t   @leekgroup\n\t\t\t  Blog\n\t\t\t   Simply Statistics"

然后将上面的内容用nchar()包裹起来赋给一个对象，这里是character。

> characters <- nchar(xpathSApply(html, "//div[@id = 'main']", xmlValue, trim = TRUE))
> characters
[1] 369

您可以删除制表符和换行符，或许可以使用 gsub()。

如何使用 R 从 XML 中提取一行？

How to extract one row from XML using R?

html

xml

r

extract