go-colly:如何在 c.OnResponse 中获取 HTML 标题,以便填充结构?
go-colly: How can I get HTML title in c.OnResponse, so I can fill the struct?
如何在 c.OnResponse 中获得 HTML.title - 或者是否有更好的替代方法来用 url/title/content
填充结构
- 最后我需要填充下面的结构并将其 post 到 elasticsearch。
type WebPage struct {
Url string `json:"url"`
Title string `json:"title"`
Content string `json:"content"`
}
// Print the response
c.OnResponse(func(r *colly.Response) {
pageCount++
log.Println(r.Headers)
webpage := WebPage{
Url: r.Ctx.Get("url"), //- can be put in ctx c.OnRequest, and r.Ctx.Get("url")
Title: "my title", //string(r.title), // Where to get this?
Content: string(r.Body), //string(r.Body) - can be done in c.OnResponse
}
enc := json.NewEncoder(os.Stdout)
enc.SetIndent("", " ")
enc.Encode(webpage) // SEND it to elasticsearch
log.Println(fmt.Sprintf("%d DONE Visiting : %s", pageCount, urlVisited))
})
我可以用下面的方法得到标题,但是 Ctx 不可用所以我不能把 "title" 值放在 Ctx 中。其他选择?
c.OnHTML("title", func(e *colly.HTMLElement) {
fmt.Println(e.Text)
e.Ctx.Put("title", e.Text) // NOT ACCESSIBLE!
})
日志
2020/05/07 17:42:37 7 DONE Visiting : https://www.coursera.org/learn/build-portfolio-website-html-css
{
"url": "https://www.coursera.org/learn/build-portfolio-website-html-css",
"title": "my page title",
"content": "page html body bla "
}
2020/05/07 17:42:37 8 DONE Visiting : https://www.coursera.org/learn/build-portfolio-website-html-css
{
"url": "https://www.coursera.org/browse/social-sciences",
"title": "my page title",
"content": "page html body bla "
}
我创建了那个结构的全局变量,并用不同的方法不断填充它
不确定这是否是最好的方法。
fun main(){
....
webpage := WebPage{} //Is this a right way to declare a mutable struct?
c.OnRequest(func(r *colly.Request) { // url
webpage.Url = r.URL.String() // Is this the right way to mutate?
})
c.OnResponse(func(r *colly.Response) { //get body
pageCount++
log.Println(fmt.Sprintf("%d DONE Visiting : %s", pageCount, webpage.Url))
})
c.OnHTML("head title", func(e *colly.HTMLElement) { // Title
webpage.Title = e.Text
})
c.OnHTML("html body", func(e *colly.HTMLElement) { // Body / content
webpage.Content = e.Text // Can url title body be misrepresented in multithread scenario?
})
c.OnHTML("a[href]", func(e *colly.HTMLElement) { // href , callback
link := e.Attr("href")
e.Request.Visit(link)
})
c.OnError(func(r *colly.Response, err error) { // Set error handler
log.Println("Request URL:", r.Request.URL, "failed with response:", r, "\nError:", err)
})
c.OnScraped(func(r *colly.Response) { // DONE
enc := json.NewEncoder(os.Stdout)
enc.SetIndent("", " ")
enc.Encode(webpage)
})
我的工作基于 Espresso 的回答...
我只是在函数中获取整个 html ,然后在其中查询头部和 body 所以一切都很好并封装到一个“c.OnHTML”
c2.OnHTML("html", func(html *colly.HTMLElement) {
slug := strings.Split(html.Request.URL.String(), "/")[4]
title := ""
descr := ""
h1 := ""
html.ForEach("head", func(_ int, head *colly.HTMLElement) {
title += head.ChildText("title")
head.ForEach("meta", func(_ int, meta *colly.HTMLElement) {
if meta.Attr("name") == "description" {
descr += meta.Attr("content")
}
})
})
html.ForEach("h1", func(_ int, h1El *colly.HTMLElement) {
h1 += h1El.Text
})
//Now you can do stuff with your elements from head and body
})
如何在 c.OnResponse 中获得 HTML.title - 或者是否有更好的替代方法来用 url/title/content
填充结构- 最后我需要填充下面的结构并将其 post 到 elasticsearch。
type WebPage struct {
Url string `json:"url"`
Title string `json:"title"`
Content string `json:"content"`
}
// Print the response
c.OnResponse(func(r *colly.Response) {
pageCount++
log.Println(r.Headers)
webpage := WebPage{
Url: r.Ctx.Get("url"), //- can be put in ctx c.OnRequest, and r.Ctx.Get("url")
Title: "my title", //string(r.title), // Where to get this?
Content: string(r.Body), //string(r.Body) - can be done in c.OnResponse
}
enc := json.NewEncoder(os.Stdout)
enc.SetIndent("", " ")
enc.Encode(webpage) // SEND it to elasticsearch
log.Println(fmt.Sprintf("%d DONE Visiting : %s", pageCount, urlVisited))
})
我可以用下面的方法得到标题,但是 Ctx 不可用所以我不能把 "title" 值放在 Ctx 中。其他选择?
c.OnHTML("title", func(e *colly.HTMLElement) {
fmt.Println(e.Text)
e.Ctx.Put("title", e.Text) // NOT ACCESSIBLE!
})
日志
2020/05/07 17:42:37 7 DONE Visiting : https://www.coursera.org/learn/build-portfolio-website-html-css
{
"url": "https://www.coursera.org/learn/build-portfolio-website-html-css",
"title": "my page title",
"content": "page html body bla "
}
2020/05/07 17:42:37 8 DONE Visiting : https://www.coursera.org/learn/build-portfolio-website-html-css
{
"url": "https://www.coursera.org/browse/social-sciences",
"title": "my page title",
"content": "page html body bla "
}
我创建了那个结构的全局变量,并用不同的方法不断填充它
不确定这是否是最好的方法。
fun main(){
....
webpage := WebPage{} //Is this a right way to declare a mutable struct?
c.OnRequest(func(r *colly.Request) { // url
webpage.Url = r.URL.String() // Is this the right way to mutate?
})
c.OnResponse(func(r *colly.Response) { //get body
pageCount++
log.Println(fmt.Sprintf("%d DONE Visiting : %s", pageCount, webpage.Url))
})
c.OnHTML("head title", func(e *colly.HTMLElement) { // Title
webpage.Title = e.Text
})
c.OnHTML("html body", func(e *colly.HTMLElement) { // Body / content
webpage.Content = e.Text // Can url title body be misrepresented in multithread scenario?
})
c.OnHTML("a[href]", func(e *colly.HTMLElement) { // href , callback
link := e.Attr("href")
e.Request.Visit(link)
})
c.OnError(func(r *colly.Response, err error) { // Set error handler
log.Println("Request URL:", r.Request.URL, "failed with response:", r, "\nError:", err)
})
c.OnScraped(func(r *colly.Response) { // DONE
enc := json.NewEncoder(os.Stdout)
enc.SetIndent("", " ")
enc.Encode(webpage)
})
我的工作基于 Espresso 的回答... 我只是在函数中获取整个 html ,然后在其中查询头部和 body 所以一切都很好并封装到一个“c.OnHTML”
c2.OnHTML("html", func(html *colly.HTMLElement) {
slug := strings.Split(html.Request.URL.String(), "/")[4]
title := ""
descr := ""
h1 := ""
html.ForEach("head", func(_ int, head *colly.HTMLElement) {
title += head.ChildText("title")
head.ForEach("meta", func(_ int, meta *colly.HTMLElement) {
if meta.Attr("name") == "description" {
descr += meta.Attr("content")
}
})
})
html.ForEach("h1", func(_ int, h1El *colly.HTMLElement) {
h1 += h1El.Text
})
//Now you can do stuff with your elements from head and body
})