R 中的网页抓取 table 仅给出 header
Web Scraping table in R only gives header
问题:尝试从以下站点访问 #gene_regulation_table,该站点在检查器中被标记为 table。但是我只得到 header 而不是实际的 table.
我尝试过的:
library(xml2)
library(httr)
library(XML)
url <- 'http://rna.sysu.edu.cn/chipbase/regulator_browse.php?organism=human&assembly=hg38&ref_gene_id=ENSG00000105835.11&gene_symbol=NAMPT#0'
website <- read_html(url)
table_node <- html_node(website, "#gene_regulation_table")
table <- html_table(table_node)
#Same exact problem happens with
tables <- getNodeSet(htmlParse(url), "//table")
xt <- readHTMLTable(tables[[2]])
所以我肯定做错了什么。
欢迎任何帮助!
页面上的table在服务器发送的初始html中为空。 table 然后由浏览器发出的 javascript XHR 请求填充,其中 returns 一个 json
字符串。您可以使用 httr::POST
函数复制它,但您需要知道所有表单参数。对于您的情况,我已将它们全部放入此处的列表中:
form_body <- list(draw = "1", `columns[0][data]` = "protein", `columns[0][name]` = "",
`columns[0][searchable]` = "true", `columns[0][orderable]` = "true",
`columns[0][search][value]` = "", `columns[0][search][regex]` = "false",
`columns[1][data]` = "synonyms", `columns[1][name]` = "",
`columns[1][searchable]` = "true", `columns[1][orderable]` = "true",
`columns[1][search][value]` = "", `columns[1][search][regex]` = "false",
`columns[2][data]` = "protein_full_name", `columns[2][name]` = "",
`columns[2][searchable]` = "true", `columns[2][orderable]` = "true",
`columns[2][search][value]` = "", `columns[2][search][regex]` = "false",
`columns[3][data]` = "upstream_sample_motif_hits", `columns[3][name]` = "",
`columns[3][searchable]` = "true", `columns[3][orderable]` = "true",
`columns[3][search][value]` = "", `columns[3][search][regex]` = "false",
`columns[4][data]` = "downstream_sample_motif_hits", `columns[4][name]` = "",
`columns[4][searchable]` = "true", `columns[4][orderable]` = "true",
`columns[4][search][value]` = "", `columns[4][search][regex]` = "false",
`columns[5][data]` = "upstream_motif", `columns[5][name]` = "",
`columns[5][searchable]` = "true", `columns[5][orderable]` = "true",
`columns[5][search][value]` = "", `columns[5][search][regex]` = "false",
`columns[6][data]` = "downstream_motif", `columns[6][name]` = "",
`columns[6][searchable]` = "true", `columns[6][orderable]` = "true",
`columns[6][search][value]` = "", `columns[6][search][regex]` = "false",
`order[0][column]` = "3", `order[0][dir]` = "desc", `order[1][column]` = "0",
`order[1][dir]` = "asc", start = "0", length = "10", `search[value]` = "",
`search[regex]` = "false", assembly = "hg38", ref_gene_id = "ENSG00000105835.11",
regulator_type = "tf", upstream = "1kb", downstream = "1kb",
motif_status = "Y", sample_flag = "0")
所以现在你可以做
form_url <- "http://rna.sysu.edu.cn/chipbase/php/get_gene_search_symbol_info.php"
result_json <- httr::content(httr::POST(form_url, body = form_body), "text")
而且很容易用 jsonlite
等 R 包解析 json 以获得包含您想要的所有信息的漂亮数据框:
df <- jsonlite::fromJSON(result_json)
dplyr::as_tibble(df$data)
#> # A tibble: 10 x 7
#> protein synonyms protein_full_na~ upstream_sample~ downstream_samp~ upstream_motif
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 FOXA1 HNF3A, ~ forkhead box A1 2 0 2
#> 2 HNF4A FRTS4, ~ hepatocyte nucl~ 2 0 2
#> 3 BARHL1 - BarH-like homeo~ 0 1 0
#> 4 BHLHE40 BHLHB2,~ basic helix-loo~ 0 1 0
#> 5 CAMTA2 - calmodulin bind~ 0 1 0
#> 6 CDX2 CDX-3, ~ caudal type hom~ 0 1 0
#> 7 CREB1 CREB cAMP responsive~ 0 2 0
#> 8 CTCF MRD21 CCCTC-binding f~ 0 16 0
#> 9 E2F1 E2F-1, ~ E2F transcripti~ 0 1 0
#> 10 E2F3 E2F-3 E2F transcripti~ 0 1 0
#> # ... with 1 more variable: downstream_motif <chr>
问题:尝试从以下站点访问 #gene_regulation_table,该站点在检查器中被标记为 table。但是我只得到 header 而不是实际的 table.
我尝试过的:
library(xml2)
library(httr)
library(XML)
url <- 'http://rna.sysu.edu.cn/chipbase/regulator_browse.php?organism=human&assembly=hg38&ref_gene_id=ENSG00000105835.11&gene_symbol=NAMPT#0'
website <- read_html(url)
table_node <- html_node(website, "#gene_regulation_table")
table <- html_table(table_node)
#Same exact problem happens with
tables <- getNodeSet(htmlParse(url), "//table")
xt <- readHTMLTable(tables[[2]])
所以我肯定做错了什么。
欢迎任何帮助!
页面上的table在服务器发送的初始html中为空。 table 然后由浏览器发出的 javascript XHR 请求填充,其中 returns 一个 json
字符串。您可以使用 httr::POST
函数复制它,但您需要知道所有表单参数。对于您的情况,我已将它们全部放入此处的列表中:
form_body <- list(draw = "1", `columns[0][data]` = "protein", `columns[0][name]` = "",
`columns[0][searchable]` = "true", `columns[0][orderable]` = "true",
`columns[0][search][value]` = "", `columns[0][search][regex]` = "false",
`columns[1][data]` = "synonyms", `columns[1][name]` = "",
`columns[1][searchable]` = "true", `columns[1][orderable]` = "true",
`columns[1][search][value]` = "", `columns[1][search][regex]` = "false",
`columns[2][data]` = "protein_full_name", `columns[2][name]` = "",
`columns[2][searchable]` = "true", `columns[2][orderable]` = "true",
`columns[2][search][value]` = "", `columns[2][search][regex]` = "false",
`columns[3][data]` = "upstream_sample_motif_hits", `columns[3][name]` = "",
`columns[3][searchable]` = "true", `columns[3][orderable]` = "true",
`columns[3][search][value]` = "", `columns[3][search][regex]` = "false",
`columns[4][data]` = "downstream_sample_motif_hits", `columns[4][name]` = "",
`columns[4][searchable]` = "true", `columns[4][orderable]` = "true",
`columns[4][search][value]` = "", `columns[4][search][regex]` = "false",
`columns[5][data]` = "upstream_motif", `columns[5][name]` = "",
`columns[5][searchable]` = "true", `columns[5][orderable]` = "true",
`columns[5][search][value]` = "", `columns[5][search][regex]` = "false",
`columns[6][data]` = "downstream_motif", `columns[6][name]` = "",
`columns[6][searchable]` = "true", `columns[6][orderable]` = "true",
`columns[6][search][value]` = "", `columns[6][search][regex]` = "false",
`order[0][column]` = "3", `order[0][dir]` = "desc", `order[1][column]` = "0",
`order[1][dir]` = "asc", start = "0", length = "10", `search[value]` = "",
`search[regex]` = "false", assembly = "hg38", ref_gene_id = "ENSG00000105835.11",
regulator_type = "tf", upstream = "1kb", downstream = "1kb",
motif_status = "Y", sample_flag = "0")
所以现在你可以做
form_url <- "http://rna.sysu.edu.cn/chipbase/php/get_gene_search_symbol_info.php"
result_json <- httr::content(httr::POST(form_url, body = form_body), "text")
而且很容易用 jsonlite
等 R 包解析 json 以获得包含您想要的所有信息的漂亮数据框:
df <- jsonlite::fromJSON(result_json)
dplyr::as_tibble(df$data)
#> # A tibble: 10 x 7
#> protein synonyms protein_full_na~ upstream_sample~ downstream_samp~ upstream_motif
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 FOXA1 HNF3A, ~ forkhead box A1 2 0 2
#> 2 HNF4A FRTS4, ~ hepatocyte nucl~ 2 0 2
#> 3 BARHL1 - BarH-like homeo~ 0 1 0
#> 4 BHLHE40 BHLHB2,~ basic helix-loo~ 0 1 0
#> 5 CAMTA2 - calmodulin bind~ 0 1 0
#> 6 CDX2 CDX-3, ~ caudal type hom~ 0 1 0
#> 7 CREB1 CREB cAMP responsive~ 0 2 0
#> 8 CTCF MRD21 CCCTC-binding f~ 0 16 0
#> 9 E2F1 E2F-1, ~ E2F transcripti~ 0 1 0
#> 10 E2F3 E2F-3 E2F transcripti~ 0 1 0
#> # ... with 1 more variable: downstream_motif <chr>