使用 R、rvest 或 rcurl 抓取文本文件
Webscrape text files using R, rvest or rcurl
所以我有一个网站,https://ais.sbarc.org/logs_delimited/,里面有一堆 link,每个 link 里面有 24 个 link,里面有 .txt 文件.
我是 R 的新手,但我能够通过一个循环 link 将 24 个文本文件放入一个数据框中。但我不知道如何循环整个目录。
我可以使用 hours.list 循环播放 24 个 link,但是 year.list 和 trip.list 不起作用...
如果这与其他网络抓取问题相似,或者我遗漏了一些非常简单的问题,我深表歉意,但我将不胜感激任何帮助
get_ais_text = function(ais_text){
hours.list = c(0:23)
hours.list_1 = sprintf('%02d', hours.list)
year.list = c(2018:2022)
year.list1 = sprintf('%d', year.list)
trip.list = c(190101:191016)
trip.list_1 = sprintf("%d", trip.list)
ais_text = tryCatch(
lapply(paste0('https://ais.sbarc.org/logs_delimited/2019/190101/AIS_SBARC_190101-', hours.list_1,'.txt'),
function(url){
url %>%
read_delim(";", col_names = sprintf("X%d", 1:25), col_types = ais_col_types)
}),
error = function(e){NA}
)
DF = do.call(rbind.data.frame, ais_text)
return(DF)
}
get_ais_text()
这对我有用:
library(rvest)
crawler <- function(base_url) {
get_links <- function(url) {
read_html(url) %>%
html_nodes("a") %>%
html_attr("href") %>%
grep("../", ., fixed = TRUE, value = TRUE, invert = TRUE) %>%
url_absolute(url)
}
links <- base_url
counter <- 1
while (sum(grepl("txt$", links)) != length(links)) {
links <- unlist(lapply(links, get_links))
message("scraping level ", counter, " [", length(links), " links]")
counter <- counter + 1
}
return(links)
}
txts <- crawler("https://ais.sbarc.org/logs_delimited/")
3级好像要放弃了,但这只是因为要经过的链接太多了。
获得所有 txt 网址后,您可以使用它来读入文件:
library(dplyr)
library(data.table)
df <- lapply(txts, fread, fill = TRUE) %>%
rbindlist() %>%
as_tibble()
我肯定会分两步执行此操作,因为它会 运行 很长一段时间,并且保存中间结果(即链接)很有意义。
如果需要,您也可以尝试 运行 并行(cl
是要使用的内核数):
library(pbapply)
df <- pblapply(txts[1:10], fread, fill = TRUE, cl = 3) %>%
rbindlist() %>%
as_tibble()
应该会快一点,而且你还会得到一个不错的进度条。
这是一个递归工作的函数,用于获取从主目录开始的所有 link。请注意 运行:
library(xml2)
library(magrittr)
.get_link <- function(u){
node <- xml2::read_html(u)
hrefs <- xml2::xml_find_all(node, ".//a[not(contains(@href,'../'))]") %>% xml_attr("href")
urls <- xml2::url_absolute(hrefs, xml_url(node))
if(!all(tools::file_ext(urls) == "txt")){
lapply(urls, .get_link)
}else {
return(urls)
}
}
这基本上是从 url
开始,然后阅读内容,使用 xpath selector
找到任何 links <a...
,它说 "all links that are not ../" 即...不是最顶层的目录 link。然后如果 link 有更多 links,遍历并获得所有这些。如果我们有最后的 links,即 .txt 文件,我们就完成了。
示例作弊并仅在 2018
开始
a <- .get_link("https://ais.sbarc.org/logs_delimited/2018/")
> a[[1]][1:2]
[1] "https://ais.sbarc.org/logs_delimited/2018/180101/AIS_SBARC_180101-00.txt"
[2] "https://ais.sbarc.org/logs_delimited/2018/180101/AIS_SBARC_180101-01.txt"
> length(a)
[1] 365
> a[[365]][1:2]
[1] "https://ais.sbarc.org/logs_delimited/2018/181231/AIS_SBARC_181231-00.txt"
[2] "https://ais.sbarc.org/logs_delimited/2018/181231/AIS_SBARC_181231-01.txt"
您要做的只是从 https://ais.sbarc.org/logs_delimited/
作为 url 输入开始,然后添加类似 data.table::fread
的内容来消化数据。我建议在单独的迭代中进行。像这样的作品:
lapply(1:length(a), function(i){
lapply(a[[i]], data.table::fread)
})
用于读取数据...
这里首先要注意的是有11636个文件。一次访问某人的服务器需要很多 links...所以我将尝试一些并展示如何做到这一点。我建议在你的电话中添加一个 Sys.sleep
电话...
# This gets all the urls
a <- .get_link("https://ais.sbarc.org/logs_delimited/")
# This unlists and gives us a unique array of the urls
b <- unique(unlist(a))
# I'm sampling b, but you would just use `b` instead of `b[...]`
a_dfs <- jsonlite::rbind_pages(lapply(b[sample(1:length(b), 20)], function(i){
df <- data.table::fread(i, sep = ";") %>% as.data.frame()
# Giving the file path for debug later if needed seems helpful
df$file_path <- i
df
}))
> a_dfs %>% head()
17:00:00:165 24 0 338179477 LAUREN SEA V8 V9 V15 V16 V17 V18 V19 V20 V21 V22 V23 file_path V1 V2 V3 V4
1 17:00:00:166 EUPHONY ACE 79 71.08 1 371618000 0 254.0 253 52 0 0 0 0 5 NA https://ais.sbarc.org/logs_delimited/2018/180113/AIS_SBARC_180113-17.txt <NA> <NA> NA NA
2 17:00:01:607 SIMONE T BRUSCO 31 32.93 3 367593050 15 255.7 97 55 0 0 1 0 503 0 https://ais.sbarc.org/logs_delimited/2018/180113/AIS_SBARC_180113-17.txt <NA> <NA> NA NA
3 17:00:01:626 POLARIS VOYAGER 89 148.80 1 311000112 0 150.0 151 53 0 0 0 0 0 22 https://ais.sbarc.org/logs_delimited/2018/180113/AIS_SBARC_180113-17.txt <NA> <NA> NA NA
4 17:00:01:631 SPECTRE 60 25.31 1 367315630 5 265.1 511 55 0 0 1 0 2 20 https://ais.sbarc.org/logs_delimited/2018/180113/AIS_SBARC_180113-17.txt <NA> <NA> NA NA
5 17:00:01:650 KEN EI 70 73.97 1 354162000 0 269.0 269 38 0 0 0 0 1 84 https://ais.sbarc.org/logs_delimited/2018/180113/AIS_SBARC_180113-17.txt <NA> <NA> NA NA
6 17:00:02:866 HANNOVER BRIDGE 70 62.17 1 372104000 0 301.1 300 56 0 0 0 0 3 1 https://ais.sbarc.org/logs_delimited/2018/180113/AIS_SBARC_180113-17.txt <NA> <NA> NA NA
V5 V6 V7 V10 V11 V12 V13 V14 02:00:00:489 338115994 1 37 SRTG0$ 10 7 4 17:00:00:798 BROADBILL 16.84 269 18 367077090 16.3 -119.981493 34.402530 264.3 511 40
1 NA NA NA NA NA NA NA NA <NA> NA NA NA NA NA NA NA <NA> <NA> NA NA <NA> <NA> NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA <NA> NA NA NA NA NA NA NA <NA> <NA> NA NA <NA> <NA> NA NA NA NA NA NA
3 NA NA NA NA NA NA NA NA <NA> NA NA NA NA NA NA NA <NA> <NA> NA NA <NA> <NA> NA NA NA NA NA NA
4 NA NA NA NA NA NA NA NA <NA> NA NA NA NA NA NA NA <NA> <NA> NA NA <NA> <NA> NA NA NA NA NA NA
5 NA NA NA NA NA NA NA NA <NA> NA NA NA NA NA NA NA <NA> <NA> NA NA <NA> <NA> NA NA NA NA NA NA
6 NA NA NA NA NA NA NA NA <NA> NA NA NA NA NA NA NA <NA> <NA> NA NA <NA> <NA> NA NA NA NA NA NA
显然需要做一些清洁工作..但我认为这就是您实现它的方式。
编辑 2
实际上我更喜欢这个,读入数据,然后拆分字符串并创建完整的数据帧:
a_dfs <- rbind_pages(lapply(b[sample(1:length(b), 20)], function(i){
raw <- readLines(i)
str_matrix <- stringi::stri_split_regex(raw, "\;", simplify = TRUE)
as.data.frame(apply(str_matrix, 2, function(j){
ifelse(!nchar(j), NA, j)
})) %>% mutate(file_name = i)
}))
> a_dfs %>% head
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25
1 09:59:57:746 STAR CARE 77 75.93 135 1 0 566341000 0 0 16.7 1 -118.839933 33.562167 321 322 50 0 0 0 0 6 19 <NA> <NA>
2 10:00:00:894 THALATTA 70 27.93 133.8 1 0 229710000 0 251 17.7 1 -119.366765 34.101742 283.9 282 55 0 0 0 0 7 <NA> <NA> <NA>
3 10:00:03:778 GULF GLORY 82 582.3 256 1 0 538007706 0 0 12.4 0 -129.345783 32.005983 87 86 54 0 0 0 0 2 20 <NA> <NA>
4 10:00:03:799 MAGPIE SW 70 68.59 123.4 1 0 352597000 0 0 10.9 0 -118.747970 33.789747 119.6 117 56 0 0 0 0 0 22 <NA> <NA>
5 10:00:09:152 CSL TECUMSEH 70 66.16 269.7 1 0 311056900 0 11 12 1 -120.846763 34.401482 105.8 106 56 0 0 0 0 6 21 <NA> <NA>
6 10:00:12:870 RANGER 85 60 31.39 117.9 1 0 367044250 0 128 0 1 -119.223133 34.162953 360 511 56 0 0 1 0 2 21 <NA> <NA>
file_name V26 V27
1 https://ais.sbarc.org/logs_delimited/2018/180211/AIS_SBARC_180211-10.txt <NA> <NA>
2 https://ais.sbarc.org/logs_delimited/2018/180211/AIS_SBARC_180211-10.txt <NA> <NA>
3 https://ais.sbarc.org/logs_delimited/2018/180211/AIS_SBARC_180211-10.txt <NA> <NA>
4 https://ais.sbarc.org/logs_delimited/2018/180211/AIS_SBARC_180211-10.txt <NA> <NA>
5 https://ais.sbarc.org/logs_delimited/2018/180211/AIS_SBARC_180211-10.txt <NA> <NA>
6 https://ais.sbarc.org/logs_delimited/2018/180211/AIS_SBARC_180211-10.txt <NA> <NA>
所以我有一个网站,https://ais.sbarc.org/logs_delimited/,里面有一堆 link,每个 link 里面有 24 个 link,里面有 .txt 文件.
我是 R 的新手,但我能够通过一个循环 link 将 24 个文本文件放入一个数据框中。但我不知道如何循环整个目录。
我可以使用 hours.list 循环播放 24 个 link,但是 year.list 和 trip.list 不起作用... 如果这与其他网络抓取问题相似,或者我遗漏了一些非常简单的问题,我深表歉意,但我将不胜感激任何帮助
get_ais_text = function(ais_text){
hours.list = c(0:23)
hours.list_1 = sprintf('%02d', hours.list)
year.list = c(2018:2022)
year.list1 = sprintf('%d', year.list)
trip.list = c(190101:191016)
trip.list_1 = sprintf("%d", trip.list)
ais_text = tryCatch(
lapply(paste0('https://ais.sbarc.org/logs_delimited/2019/190101/AIS_SBARC_190101-', hours.list_1,'.txt'),
function(url){
url %>%
read_delim(";", col_names = sprintf("X%d", 1:25), col_types = ais_col_types)
}),
error = function(e){NA}
)
DF = do.call(rbind.data.frame, ais_text)
return(DF)
}
get_ais_text()
这对我有用:
library(rvest)
crawler <- function(base_url) {
get_links <- function(url) {
read_html(url) %>%
html_nodes("a") %>%
html_attr("href") %>%
grep("../", ., fixed = TRUE, value = TRUE, invert = TRUE) %>%
url_absolute(url)
}
links <- base_url
counter <- 1
while (sum(grepl("txt$", links)) != length(links)) {
links <- unlist(lapply(links, get_links))
message("scraping level ", counter, " [", length(links), " links]")
counter <- counter + 1
}
return(links)
}
txts <- crawler("https://ais.sbarc.org/logs_delimited/")
3级好像要放弃了,但这只是因为要经过的链接太多了。
获得所有 txt 网址后,您可以使用它来读入文件:
library(dplyr)
library(data.table)
df <- lapply(txts, fread, fill = TRUE) %>%
rbindlist() %>%
as_tibble()
我肯定会分两步执行此操作,因为它会 运行 很长一段时间,并且保存中间结果(即链接)很有意义。
如果需要,您也可以尝试 运行 并行(cl
是要使用的内核数):
library(pbapply)
df <- pblapply(txts[1:10], fread, fill = TRUE, cl = 3) %>%
rbindlist() %>%
as_tibble()
应该会快一点,而且你还会得到一个不错的进度条。
这是一个递归工作的函数,用于获取从主目录开始的所有 link。请注意 运行:
library(xml2)
library(magrittr)
.get_link <- function(u){
node <- xml2::read_html(u)
hrefs <- xml2::xml_find_all(node, ".//a[not(contains(@href,'../'))]") %>% xml_attr("href")
urls <- xml2::url_absolute(hrefs, xml_url(node))
if(!all(tools::file_ext(urls) == "txt")){
lapply(urls, .get_link)
}else {
return(urls)
}
}
这基本上是从 url
开始,然后阅读内容,使用 xpath selector
找到任何 links <a...
,它说 "all links that are not ../" 即...不是最顶层的目录 link。然后如果 link 有更多 links,遍历并获得所有这些。如果我们有最后的 links,即 .txt 文件,我们就完成了。
示例作弊并仅在 2018
开始a <- .get_link("https://ais.sbarc.org/logs_delimited/2018/")
> a[[1]][1:2]
[1] "https://ais.sbarc.org/logs_delimited/2018/180101/AIS_SBARC_180101-00.txt"
[2] "https://ais.sbarc.org/logs_delimited/2018/180101/AIS_SBARC_180101-01.txt"
> length(a)
[1] 365
> a[[365]][1:2]
[1] "https://ais.sbarc.org/logs_delimited/2018/181231/AIS_SBARC_181231-00.txt"
[2] "https://ais.sbarc.org/logs_delimited/2018/181231/AIS_SBARC_181231-01.txt"
您要做的只是从 https://ais.sbarc.org/logs_delimited/
作为 url 输入开始,然后添加类似 data.table::fread
的内容来消化数据。我建议在单独的迭代中进行。像这样的作品:
lapply(1:length(a), function(i){
lapply(a[[i]], data.table::fread)
})
用于读取数据...
这里首先要注意的是有11636个文件。一次访问某人的服务器需要很多 links...所以我将尝试一些并展示如何做到这一点。我建议在你的电话中添加一个 Sys.sleep
电话...
# This gets all the urls
a <- .get_link("https://ais.sbarc.org/logs_delimited/")
# This unlists and gives us a unique array of the urls
b <- unique(unlist(a))
# I'm sampling b, but you would just use `b` instead of `b[...]`
a_dfs <- jsonlite::rbind_pages(lapply(b[sample(1:length(b), 20)], function(i){
df <- data.table::fread(i, sep = ";") %>% as.data.frame()
# Giving the file path for debug later if needed seems helpful
df$file_path <- i
df
}))
> a_dfs %>% head()
17:00:00:165 24 0 338179477 LAUREN SEA V8 V9 V15 V16 V17 V18 V19 V20 V21 V22 V23 file_path V1 V2 V3 V4
1 17:00:00:166 EUPHONY ACE 79 71.08 1 371618000 0 254.0 253 52 0 0 0 0 5 NA https://ais.sbarc.org/logs_delimited/2018/180113/AIS_SBARC_180113-17.txt <NA> <NA> NA NA
2 17:00:01:607 SIMONE T BRUSCO 31 32.93 3 367593050 15 255.7 97 55 0 0 1 0 503 0 https://ais.sbarc.org/logs_delimited/2018/180113/AIS_SBARC_180113-17.txt <NA> <NA> NA NA
3 17:00:01:626 POLARIS VOYAGER 89 148.80 1 311000112 0 150.0 151 53 0 0 0 0 0 22 https://ais.sbarc.org/logs_delimited/2018/180113/AIS_SBARC_180113-17.txt <NA> <NA> NA NA
4 17:00:01:631 SPECTRE 60 25.31 1 367315630 5 265.1 511 55 0 0 1 0 2 20 https://ais.sbarc.org/logs_delimited/2018/180113/AIS_SBARC_180113-17.txt <NA> <NA> NA NA
5 17:00:01:650 KEN EI 70 73.97 1 354162000 0 269.0 269 38 0 0 0 0 1 84 https://ais.sbarc.org/logs_delimited/2018/180113/AIS_SBARC_180113-17.txt <NA> <NA> NA NA
6 17:00:02:866 HANNOVER BRIDGE 70 62.17 1 372104000 0 301.1 300 56 0 0 0 0 3 1 https://ais.sbarc.org/logs_delimited/2018/180113/AIS_SBARC_180113-17.txt <NA> <NA> NA NA
V5 V6 V7 V10 V11 V12 V13 V14 02:00:00:489 338115994 1 37 SRTG0$ 10 7 4 17:00:00:798 BROADBILL 16.84 269 18 367077090 16.3 -119.981493 34.402530 264.3 511 40
1 NA NA NA NA NA NA NA NA <NA> NA NA NA NA NA NA NA <NA> <NA> NA NA <NA> <NA> NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA <NA> NA NA NA NA NA NA NA <NA> <NA> NA NA <NA> <NA> NA NA NA NA NA NA
3 NA NA NA NA NA NA NA NA <NA> NA NA NA NA NA NA NA <NA> <NA> NA NA <NA> <NA> NA NA NA NA NA NA
4 NA NA NA NA NA NA NA NA <NA> NA NA NA NA NA NA NA <NA> <NA> NA NA <NA> <NA> NA NA NA NA NA NA
5 NA NA NA NA NA NA NA NA <NA> NA NA NA NA NA NA NA <NA> <NA> NA NA <NA> <NA> NA NA NA NA NA NA
6 NA NA NA NA NA NA NA NA <NA> NA NA NA NA NA NA NA <NA> <NA> NA NA <NA> <NA> NA NA NA NA NA NA
显然需要做一些清洁工作..但我认为这就是您实现它的方式。
编辑 2
实际上我更喜欢这个,读入数据,然后拆分字符串并创建完整的数据帧:
a_dfs <- rbind_pages(lapply(b[sample(1:length(b), 20)], function(i){
raw <- readLines(i)
str_matrix <- stringi::stri_split_regex(raw, "\;", simplify = TRUE)
as.data.frame(apply(str_matrix, 2, function(j){
ifelse(!nchar(j), NA, j)
})) %>% mutate(file_name = i)
}))
> a_dfs %>% head
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25
1 09:59:57:746 STAR CARE 77 75.93 135 1 0 566341000 0 0 16.7 1 -118.839933 33.562167 321 322 50 0 0 0 0 6 19 <NA> <NA>
2 10:00:00:894 THALATTA 70 27.93 133.8 1 0 229710000 0 251 17.7 1 -119.366765 34.101742 283.9 282 55 0 0 0 0 7 <NA> <NA> <NA>
3 10:00:03:778 GULF GLORY 82 582.3 256 1 0 538007706 0 0 12.4 0 -129.345783 32.005983 87 86 54 0 0 0 0 2 20 <NA> <NA>
4 10:00:03:799 MAGPIE SW 70 68.59 123.4 1 0 352597000 0 0 10.9 0 -118.747970 33.789747 119.6 117 56 0 0 0 0 0 22 <NA> <NA>
5 10:00:09:152 CSL TECUMSEH 70 66.16 269.7 1 0 311056900 0 11 12 1 -120.846763 34.401482 105.8 106 56 0 0 0 0 6 21 <NA> <NA>
6 10:00:12:870 RANGER 85 60 31.39 117.9 1 0 367044250 0 128 0 1 -119.223133 34.162953 360 511 56 0 0 1 0 2 21 <NA> <NA>
file_name V26 V27
1 https://ais.sbarc.org/logs_delimited/2018/180211/AIS_SBARC_180211-10.txt <NA> <NA>
2 https://ais.sbarc.org/logs_delimited/2018/180211/AIS_SBARC_180211-10.txt <NA> <NA>
3 https://ais.sbarc.org/logs_delimited/2018/180211/AIS_SBARC_180211-10.txt <NA> <NA>
4 https://ais.sbarc.org/logs_delimited/2018/180211/AIS_SBARC_180211-10.txt <NA> <NA>
5 https://ais.sbarc.org/logs_delimited/2018/180211/AIS_SBARC_180211-10.txt <NA> <NA>
6 https://ais.sbarc.org/logs_delimited/2018/180211/AIS_SBARC_180211-10.txt <NA> <NA>