go-colly：如何在 c.OnResponse 中获取 HTML 标题，以便填充结构？

Question

如何在 c.OnResponse 中获得 HTML.title - 或者是否有更好的替代方法来用 url/title/content

填充结构

最后我需要填充下面的结构并将其 post 到 elasticsearch。

type WebPage struct {
    Url     string `json:"url"`
    Title   string `json:"title"`
    Content string `json:"content"`
}

    // Print the response
    c.OnResponse(func(r *colly.Response) {
        pageCount++
        log.Println(r.Headers)


        webpage := WebPage{
            Url:     r.Ctx.Get("url"), //- can be put in ctx c.OnRequest, and r.Ctx.Get("url")
            Title:   "my title", //string(r.title), // Where to get this?
            Content: string(r.Body),  //string(r.Body) - can be done in c.OnResponse
        }

        enc := json.NewEncoder(os.Stdout)
        enc.SetIndent("", "  ")
        enc.Encode(webpage) // SEND it to elasticsearch 

        log.Println(fmt.Sprintf("%d  DONE Visiting : %s", pageCount, urlVisited))

    })

我可以用下面的方法得到标题，但是 Ctx 不可用所以我不能把 "title" 值放在 Ctx 中。其他选择？

    c.OnHTML("title", func(e *colly.HTMLElement) {
        fmt.Println(e.Text)
        e.Ctx.Put("title", e.Text) // NOT ACCESSIBLE!
    })

日志

2020/05/07 17:42:37 7  DONE Visiting : https://www.coursera.org/learn/build-portfolio-website-html-css
{
  "url": "https://www.coursera.org/learn/build-portfolio-website-html-css",
  "title": "my page title",
  "content": "page html body bla "
}
2020/05/07 17:42:37 8  DONE Visiting : https://www.coursera.org/learn/build-portfolio-website-html-css
{
  "url": "https://www.coursera.org/browse/social-sciences",
  "title": "my page title",
  "content": "page html body bla "
}

Answer 1

我创建了那个结构的全局变量，并用不同的方法不断填充它

不确定这是否是最好的方法。


fun  main(){
....

    webpage := WebPage{} //Is this a right way to declare a mutable struct?

    c.OnRequest(func(r *colly.Request) { // url
        webpage.Url = r.URL.String() // Is this the right way to mutate?

    })

    c.OnResponse(func(r *colly.Response) { //get body
        pageCount++
        log.Println(fmt.Sprintf("%d  DONE Visiting : %s", pageCount, webpage.Url))

    })

    c.OnHTML("head title", func(e *colly.HTMLElement) { // Title
        webpage.Title = e.Text
    })
    c.OnHTML("html body", func(e *colly.HTMLElement) { // Body / content
        webpage.Content = e.Text  // Can url title body be misrepresented in multithread scenario?
    })

    c.OnHTML("a[href]", func(e *colly.HTMLElement) { // href , callback
        link := e.Attr("href")
        e.Request.Visit(link)
    })

    c.OnError(func(r *colly.Response, err error) { // Set error handler
        log.Println("Request URL:", r.Request.URL, "failed with response:", r, "\nError:", err)
    })

    c.OnScraped(func(r *colly.Response) { // DONE
        enc := json.NewEncoder(os.Stdout)
        enc.SetIndent("", "  ")
        enc.Encode(webpage)
    })

Answer 2

我的工作基于 Espresso 的回答... 我只是在函数中获取整个 html ，然后在其中查询头部和 body 所以一切都很好并封装到一个“c.OnHTML”

c2.OnHTML("html", func(html *colly.HTMLElement) {
    slug := strings.Split(html.Request.URL.String(), "/")[4]
    title := ""
    descr := ""
    h1    := ""

    html.ForEach("head", func(_ int, head *colly.HTMLElement) {
        title += head.ChildText("title")
        head.ForEach("meta", func(_ int, meta *colly.HTMLElement) {
            if meta.Attr("name") == "description" {
                descr += meta.Attr("content")
            }
        })
    })

    html.ForEach("h1", func(_ int, h1El *colly.HTMLElement) {
        h1 += h1El.Text
    })

    //Now you can do stuff with your elements from head and body
})

go-colly：如何在 c.OnResponse 中获取 HTML 标题，以便填充结构？

go-colly: How can I get HTML title in c.OnResponse, so I can fill the struct?

go

elasticsearch

web-scraping

go-colly