使用 rvest 抓取 java 个脚本对象

Scraping java scripted objects using rvest

我正在尝试从网页中抓取 java 脚本对象。我按照建议尝试了 JIRA API,但没有收到 activity 日志。我找到了一个解释如何抓取 java 脚本化对象的网站。例如,见下文

https://datascienceplus.com/scraping-javascript-rendered-web-content-using-r/

我按照示例进行操作,但我发现很难理解我需要将什么作为 xpath 信息发送才能列出 activity 日志。我正在尝试抓取网页底部全选项卡容器下的 activity 日志。

library(rvest)
library(V8)
#URL with js-rendered content to be scraped

link<- 'https://issues.apache.org/jira/browse/AMQCPP-645'
#Read the html page content and extract all javascript codes that are inside a list
#html<- getURL(link, followlocation = TRUE)
 emailjs <- read_html(link) %>% html_nodes(xpath = "//div") %>% html_text()


  ct <- v8()
 #parse the html content from the js output and print it as text
   read_html(ct$eval(gsub('document.write','',emailjs))) %>% 
   html_text()

我希望得到这样的输出:

       rows  emailjs
        1      S A created issue - 25/Apr/19 15:48 Highlight in document.    
        2      Justin Bertram made changes - 25/Apr/19 17:53 Field Original Value 
      New 
     Value  Comment [ I'm using Firefox, and it's working no problem. It's 
     just HTML so    there shouldn't be any browser compatibility issues. 
     My guess is that Firefox  is holding on to an older, cached version or 
     something. Try opening a "private browsing" window and trying it from 
     there. ] Highlight in document.

       3      Timothy Bish made changes - 25/Apr/19 18:10 Resolution Fixed [ 1 ] 
        Status 
      Open [ 1 ] Closed [ 6 ] Highlight in document.
       4       Timothy Bish made transition - 25/Apr/19 18:10 Open Closed 2h 22m 1

建议将不胜感激。谢谢!

您可以模仿页面发出的 POST 请求并添加所需的 header。然后 html 解析所需内容的响应。您可能需要做更多的字符串整理工作。

library(httr)
library(rvest)
library(magrittr)

headers = c('X-Requested-With' = 'XMLHttpRequest')

data = '[{"name":"jira.viewissue.tab.clicked","properties":{"inNewWindow":false,"keyboard":false,"context":"unknown","tab":"com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel","tabPosition":1},"timeDelta":-4904},{"name":"jira.viewissue.tab.clicked","properties":{"inNewWindow":false,"keyboard":false,"context":"unknown","tab":"com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel","tabPosition":0},"timeDelta":-4178}]'

rows <- read_html(httr::POST(url = 'https://issues.apache.org/jira/browse/AMQCPP-645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&_=1570029676497', httr::add_headers(.headers=headers), body = data))%>%
        html_nodes('.issuePanelWrapper .issue-data-block')%>%
        html_text()%>% 
        gsub('\s+|\n+', ' ', .)