从 GroupGrid 中抓取数据
Scraping data from a GroupGrid
我想抓取并分析 when2meet table 的输入。
这是一个示例:http://www.when2meet.com/?4474391-IBuBA
table 可以快速直观地了解每个组成员的空闲情况;我想将其提取到 R 中进行一些分析,但我做不到。
事实上很短;我只提取了主页元素。输出(对我来说)是乱码:
library(rvest)
url <- "http://www.when2meet.com/?4474391-IBuBA"
grid <- html(url) %>% html_nodes(xpath = '//*[@id="GroupGrid"]')
grid
看起来像这样:
<div style="font-size:0px;vertical-align:top;"><div id="GroupTime279816300" onmouseover="ShowSlot(279816300);" style="vertical-align:top;display:inline-block;*display:inline;zoom:1;width:44px;height:9px;font-size:0px;border-left: 1px black solid;background: #c5e2b6;"><script><![CDATA[
Col[TimeOfSlot.indexOf(279816300)] = 0;
Row[TimeOfSlot.indexOf(279816300)] = 23;
]]></script></div>
<div id="GroupTime279902700" onmouseover="ShowSlot(279902700);" style="vertical-align:top;display:inline-block;*display:inline;zoom:1;width:44px;height:9px;font-size:0px;border-left: 1px black solid;background: #8ac56d;"><script><![CDATA[
Col[TimeOfSlot.indexOf(279902700)] = 1;
Row[TimeOfSlot.indexOf(279902700)] = 23;
]]></script></div>
<div id="GroupTime279989100" onmouseover="ShowSlot(279989100);" style="vertical-align:top;display:inline-block;*display:inline;zoom:1;width:44px;height:9px;font-size:0px;border-left: 1px black solid;background: #c5e2b6;"><script><![CDATA[
Col[TimeOfSlot.indexOf(279989100)] = 2;
Row[TimeOfSlot.indexOf(279989100)] = 23;
]]></script>
我在这里基本上看不到对我有用的东西;也可能是乌尔都语。而且我无法在 Google 或 SO 上找到任何关于抓取 GroupGrid tables.
的信息
有人知道如何进行吗?
理想情况下,我的输出 data.table
(data.frame
,如果需要的话)形式为:
output
# id slot available
# 1: user_1 M 9:00 TRUE
# 2: user_1 T 9:30 FALSE
# 3: user_1 W 10:00 TRUE
# 4: user_1 R 10:30 TRUE
# 5: user_2 M 9:00 TRUE
# 6: user_2 T 9:30 FALSE
# 7: user_2 W 10:00 TRUE
# 8: user_2 R 10:30 FALSE
(slot
列的确切格式并不重要,也不需要是一列——如果更简单,可以是 day
和 time
)
你可以这样做
library(data.table)
script <- html("http://www.when2meet.com/?4474391-IBuBA") %>%
html_nodes("script:contains('PeopleNames')") %>% html_text()
f <- function(regex) {
m <- regmatches(script, gregexpr(regex, script))[[1]]
#faster than transposing with `t`
setDT(transpose(lapply(regmatches(m, regexec(regex, m)), "[", -1)))[]
}
slots <- f("TimeOfSlot\[(\d+)\]=(\d+);")
users <- f( "PeopleNames\[(\d+)\] = '([^']+)';PeopleIDs\[\d+\] = (\d+);")
avails <- f("AvailableAtSlot\[(\d+)]\.push\((\d+)\);")
DT <- melt(dcast(avails, V2~V1,
fun.aggregate = function(x) length(x) > 0,
value.var = "V2"), id.vars = "V2",
variable.name = "timeslot", value.name = "available")
DT[users, id := i.V2, on = c(V2 = "V3")]
DT[slots, time := format(as.POSIXct(as.integer(
i.V2), origin = "1970-01-01", tz = "GMT"), "%a %H:%M"),
on = c(timeslot = "V1")]
DT[ , c("V2", "timeslot") := NULL]
DT[time == "Mon 11:00" & available]
# available id time
# 1: TRUE user_1 Mon 11:00
# 2: TRUE user_2 Mon 11:00
# 3: TRUE user_3 Mon 11:00
# 4: TRUE user_4 Mon 11:00
# 5: TRUE user_5 Mon 11:00
# 6: TRUE user_7 Mon 11:00
# 7: TRUE user_10 Mon 11:00
DT[time == "Mon 11:00" & !available]
# available id time
# 1: FALSE user_6 Mon 11:00
# 2: FALSE user_8 Mon 11:00
# 3: FALSE user_9 Mon 11:00
我想抓取并分析 when2meet table 的输入。
这是一个示例:http://www.when2meet.com/?4474391-IBuBA
table 可以快速直观地了解每个组成员的空闲情况;我想将其提取到 R 中进行一些分析,但我做不到。
事实上很短;我只提取了主页元素。输出(对我来说)是乱码:
library(rvest)
url <- "http://www.when2meet.com/?4474391-IBuBA"
grid <- html(url) %>% html_nodes(xpath = '//*[@id="GroupGrid"]')
grid
看起来像这样:
<div style="font-size:0px;vertical-align:top;"><div id="GroupTime279816300" onmouseover="ShowSlot(279816300);" style="vertical-align:top;display:inline-block;*display:inline;zoom:1;width:44px;height:9px;font-size:0px;border-left: 1px black solid;background: #c5e2b6;"><script><![CDATA[
Col[TimeOfSlot.indexOf(279816300)] = 0;
Row[TimeOfSlot.indexOf(279816300)] = 23;
]]></script></div>
<div id="GroupTime279902700" onmouseover="ShowSlot(279902700);" style="vertical-align:top;display:inline-block;*display:inline;zoom:1;width:44px;height:9px;font-size:0px;border-left: 1px black solid;background: #8ac56d;"><script><![CDATA[
Col[TimeOfSlot.indexOf(279902700)] = 1;
Row[TimeOfSlot.indexOf(279902700)] = 23;
]]></script></div>
<div id="GroupTime279989100" onmouseover="ShowSlot(279989100);" style="vertical-align:top;display:inline-block;*display:inline;zoom:1;width:44px;height:9px;font-size:0px;border-left: 1px black solid;background: #c5e2b6;"><script><![CDATA[
Col[TimeOfSlot.indexOf(279989100)] = 2;
Row[TimeOfSlot.indexOf(279989100)] = 23;
]]></script>
我在这里基本上看不到对我有用的东西;也可能是乌尔都语。而且我无法在 Google 或 SO 上找到任何关于抓取 GroupGrid tables.
的信息有人知道如何进行吗?
理想情况下,我的输出 data.table
(data.frame
,如果需要的话)形式为:
output
# id slot available
# 1: user_1 M 9:00 TRUE
# 2: user_1 T 9:30 FALSE
# 3: user_1 W 10:00 TRUE
# 4: user_1 R 10:30 TRUE
# 5: user_2 M 9:00 TRUE
# 6: user_2 T 9:30 FALSE
# 7: user_2 W 10:00 TRUE
# 8: user_2 R 10:30 FALSE
(slot
列的确切格式并不重要,也不需要是一列——如果更简单,可以是 day
和 time
)
你可以这样做
library(data.table)
script <- html("http://www.when2meet.com/?4474391-IBuBA") %>%
html_nodes("script:contains('PeopleNames')") %>% html_text()
f <- function(regex) {
m <- regmatches(script, gregexpr(regex, script))[[1]]
#faster than transposing with `t`
setDT(transpose(lapply(regmatches(m, regexec(regex, m)), "[", -1)))[]
}
slots <- f("TimeOfSlot\[(\d+)\]=(\d+);")
users <- f( "PeopleNames\[(\d+)\] = '([^']+)';PeopleIDs\[\d+\] = (\d+);")
avails <- f("AvailableAtSlot\[(\d+)]\.push\((\d+)\);")
DT <- melt(dcast(avails, V2~V1,
fun.aggregate = function(x) length(x) > 0,
value.var = "V2"), id.vars = "V2",
variable.name = "timeslot", value.name = "available")
DT[users, id := i.V2, on = c(V2 = "V3")]
DT[slots, time := format(as.POSIXct(as.integer(
i.V2), origin = "1970-01-01", tz = "GMT"), "%a %H:%M"),
on = c(timeslot = "V1")]
DT[ , c("V2", "timeslot") := NULL]
DT[time == "Mon 11:00" & available]
# available id time
# 1: TRUE user_1 Mon 11:00
# 2: TRUE user_2 Mon 11:00
# 3: TRUE user_3 Mon 11:00
# 4: TRUE user_4 Mon 11:00
# 5: TRUE user_5 Mon 11:00
# 6: TRUE user_7 Mon 11:00
# 7: TRUE user_10 Mon 11:00
DT[time == "Mon 11:00" & !available]
# available id time
# 1: FALSE user_6 Mon 11:00
# 2: FALSE user_8 Mon 11:00
# 3: FALSE user_9 Mon 11:00