我如何 crawl/scrape(使用 R)非 table EPA CompTox 仪表板?

How can I crawl/scrape (using R) the non-table EPA CompTox Dashboard?

EPA CompTox 化学品仪表板收到更新,我的旧代码不再能够刮擦化学品的沸点。谁能帮我刮一下实验平均沸点?我需要能够编写可以遍历多种化学品的 R 代码。

示例网页:
丙酮:https://comptox.epa.gov/dashboard/chemical/properties/DTXSID8021482
甲烷:https://comptox.epa.gov/dashboard/chemical/properties/DTXSID8025545

我试过 read_html()xmlParse() 都没有成功。 XML.

中未显示实验平均沸点 (ExpAvBP) 值

我尝试使用 RCrawler 中的 ContentScraper(),但无论我尝试什么,它都只是 returns NA。此外,这仅适用于列出的第一个网页,因为单元 ID 随每种化学物质而变化。

ContentScraper(Url="https://comptox.epa.gov/dashboard/chemical/properties/DTXSID8021482", XpathPatterns = "//*[@id='cell-225']")

我试过使用 readLines(),但信息都塞进了最后一个脚本标签,我不确定如何只隔离 ExpAvBP 值。看起来价值存储在其他地方?例如,下面是我认为是最后一个脚本标签中的沸点信息。

丙酮:

{unit:c_,name:"Boiling Point",predicted:{rawData:[{value:c$,minValue:e,maxValue:e,source:am,description:an,modelName:"TEST_BP",modelId:T,hasOpera:d,globalApplicability:e,hasQmrfPdf:d,details:{value:B,link:"https:\u002F\u002Fs3.amazonaws.com\u002Fepa-comptox\u002Ftest-reports\u002FDTXCID101482-TEST_BP.html",showLink:a},qmrf:{value:e,link:e,showLink:d}},{value:44.8,minValue:e,maxValue:e,source:ci,description:cj,modelName:"EPISUITE_BP",modelId:dV,hasOpera:d,globalApplicability:e,hasQmrfPdf:d,details:{value:M,link:e,showLink:d},qmrf:{value:e,link:e,showLink:d}},{value:46.458,minValue:e,maxValue:e,source:ad,description:V,modelName:"ACD_BP",modelId:135,hasOpera:d,globalApplicability:e,hasQmrfPdf:d,details:{value:M,link:e,showLink:d},qmrf:{value:e,link:e,showLink:d}},{value:da,minValue:e,maxValue:e,source:aL,description:bo,modelName:"OPERA_BP",modelId:dS,hasOpera:a,globalApplicability:q,hasQmrfPdf:a,details:{value:B,link:"http:\u002F\u002Fcomptox-dev.epa.gov\u002Fdashboard\u002Fdsstoxdb\u002Fcalculation_details?model_id=27&search=21482",showLink:a},qmrf:{value:B,link:"http:\u002F\u002Fcomptox-dev.epa.gov\u002Fdashboard\u002Fdsstoxdb\u002Fdownload_qmrf_pdf?model=27",showLink:a}}],count:bu,mean:47.06289999999999,min:c$,max:da,range:[c$,da],median:45.629},experimental:{rawData:[{value:db,minValue:e,maxValue:e,source:aN,description:aO,experimentalDetails:[]},{value:ak,minValue:ak,maxValue:ak,source:ck,description:cl,experimentalDetails:[]},{value:ak,minValue:ak,maxValue:ak,source:ck,description:cl,experimentalDetails:[]},{value:ak,minValue:ak,maxValue:ak,source:"Food and Agriculture Organization of the United Nations",description:"The Joint FAO\u002FWHO Expert Committee on Food Additives (JECFA) is an international expert scientific committee that is administered jointly by the Food and Agriculture Organization of the United Nations (FAO) and the World Health Organization (WHO). Website: \u003Ca href="http:\u002F\u002Fwww.fao.org\u002Fhome\u002F" target="_blank"\u003Ehttp:\u002F\u002Fwww.fao.org\u002Fhome\u002F\u003C\u002Fa\u003E",experimentalDetails:[]},{value:56.05,minValue:e,maxValue:e,source:"Abooali et al. Int. J. Refrig. 2014, 40, 282–293",description:"Abooali, D.; Sobati, M. A. Novel method for prediction of normal boiling point and enthalpy of vaporization at normal boiling point of pure refrigerants: A QSPR approach. (\u003Ca href="http:\u002F\u002Fdx.doi.org\u002F10.1016\u002Fj.ijrefrig.2013.12.007" target="_blank"\u003EInt. J. Refrig. 2014, 40, 282–293\u003C\u002Fa\u003E)\r\n",experimentalDetails:[]},{value:bO,minValue:bO,maxValue:bO,source:hI,description:hJ,experimentalDetails:[]}],count:dK,mean:55.98518333333333,min:db,max:bO,range:[db,bO],median:ak},arrKey:"BOILING_POINT"}

甲烷:

{unit:cO,name:"Boiling Point",predicted:{rawData:[{value:at,minValue:f,maxValue:f,source:bB,description:bb,modelName:"ACD_BP",modelId:135,hasOpera:d,globalApplicability:f,hasQmrfPdf:d,details:{value:ag,link:f,showLink:d},qmrf:{value:f,link:f,showLink:d}},{value:hl,minValue:f,maxValue:f,source:aF,description:ba,modelName:"OPERA_BP",modelId:dv,hasOpera:a,globalApplicability:s,hasQmrfPdf:a,details:{value:O,link:"http:\u002F\u002Fcomptox-dev.epa.gov\u002Fdashboard\u002Fdsstoxdb\u002Fcalculation_details?model_id=27&search=25545",showLink:a},qmrf:{value:O,link:"http:\u002F\u002Fcomptox-dev.epa.gov\u002Fdashboard\u002Fdsstoxdb\u002Fdownload_qmrf_pdf?model=27",showLink:a}},{value:cP,minValue:f,maxValue:f,source:bZ,description:b_,modelName:"EPISUITE_BP",modelId:dy,hasOpera:d,globalApplicability:f,hasQmrfPdf:d,details:{value:ag,link:f,showLink:d},qmrf:{value:f,link:f,showLink:d}}],count:bH,mean:-129.25300000000001,min:at,max:cP,range:[at,cP],median:hl},experimental:{rawData:[{value:at,minValue:at,maxValue:at,source:hm,description:hn,experimentalDetails:[]},{value:cQ,minValue:f,maxValue:f,source:bC,description:bD,experimentalDetails:[]}],count:H,mean:ho,min:at,max:cQ,range:[at,cQ],median:ho},arrKey:"BOILING_POINT"}

任何帮助或见解将不胜感激!

由于数据不是 table 格式,我们必须提取文本并通过匹配模式 BoilingPoint.

提取沸腾温度
library(rvest)
library(dplyr)
library(RSelenium)
    
 url = 'https://comptox.epa.gov/dashboard/chemical/properties/DTXSID8025545'
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
remDr$navigate(url)

df = remDr$getPageSource()[[1]] %>% 
  read_html() %>% html_nodes(xpath = '//*[@id="__layout"]/div/div[5]/div[2]/main/div/div[3]/div[2]/div/div[2]/div[2]/div[3]') %>% 
  html_text()

现在获取沸腾温度。引用

df1 = df %>% str_remove_all( '\n') %>% str_replace_all( ' ', '')
as.numeric(sub(".*?BoilingPoint.*?(\d+).*", "\1", df1))
[1] 163

您可能需要做进一步的微调以获得沸腾温度的小数点。