使用 goquery 从网站检索文本
retrieving text from a website with goquery
我有一个 html 大致如下所示:
<h4>Movies</h4>
<h5><a href="external_link" target="_blank"> A Song For Jenny</a> (2015)</h5>
Rating: PG<br/>
Running Time (minutes): 77<br/>
Description: This Drama, based on real life events, tells the story of a family affected directly by the 7/7 London bombings. It shows love, loss, heartache and ...<br/>
<a href="/bmm/shop/Movie_Detail?movieid=2713288">More about A Song For Jenny</a><br/>
<a href="/bmm/shop/Edit_Movie?movieid=2713288">Edit A Song For Jenny</a><br/>
<br/>
<h5><a href="link" target="_blank">#RealityHigh</a> (2017)</h5>
Rating: PG<br/>
Running Time (minutes): 99<br/>
Description: High-achieving high-school senior Dani Barnes dreams of getting into UC Davis, the world's top veterinary school. Then a glamorous new friend draws ...<br/>
<a href="/bmm/shop/Movie_Detail?movieid=4089906">More about #RealityHigh</a><br/>
<a href="/bmm/shop/Edit_Movie?movieid=4089906">Edit #RealityHigh</a><br/>
<br/>
<h5><a href="link" target="_blank">1 Night</a> (2016)</h5>
Rating: PG<br/>
Running Time (minutes): 80<br/>
Description: Bea, a worrisome teenager, reconnects with her introverted childhood friend, Andy. The two overcome their differences in social status one night aft ...<br/>
<a href="/bmm/shop/Movie_Detail?movieid=3959071">More about 1 Night</a><br/>
<a href="/bmm/shop/Edit_Movie?movieid=3959071">Edit 1 Night</a><br/>
<br/>
<h5><a href="link" target="_blank">10 Cloverfield Lane</a> (2016)</h5>
Rating: PG<br/>
Running Time (minutes): 104<br/>
Description: Soon after leaving her fiancé Michelle is involved in a car accident. She awakens
to find herself sharing an underground bunker with Howard and Emme ...<br/>
<a href="/bmm/shop/Movie_Detail?movieid=3052189">More about 10 Cloverfield Lane</a><br/>
<a href="/bmm/shop/Edit_Movie?movieid=3052189">Edit 10 Cloverfield Lane</a><br/>
<br/>
我需要使用 goquery 从该页面获取尽可能多的信息。我知道如何提取此片段中由单词 "link" 替换的外部链接,我知道如何获取包含更多详细信息的链接,但我还想提取仅包含在文本中的信息,即年份(在标题),运行 时间,简短描述和 PG 评级。
我无法弄清楚如何在 goquery 中执行此操作,因为此文本未被任何 div 或其他标签包围。我试着寻找 h5 标签,然后在它们上调用 .Next() 但我只能找到 <br>
标签,而不是中间的文本。我怎样才能做到这一点?如果有比使用 goquery 更好的方法,我对此表示满意。
我的代码看起来像这样。
// Retrieve the page count:
res, err = http.Get("myUrlAddress")
if err != nil {
fmt.Println(err)
os.Exit(-1)
}
doc, err = goquery.NewDocumentFromResponse(res)
if err != nil {
fmt.Println(err)
os.Exit(-1)
}
links := doc.Find(`a[href*="pageIndex"]`)
fmt.Println(links.Length()) // Output page count
s := doc.Find("h5").First().Next() // I expect it to be the text after the heading.
fmt.Println(s.Text()) // But it's empty and if I check the node type it says br
不幸的是,由于此 HTML 页面的结构,在您确定包含示例中的电影列表的页面部分后,goquery 似乎不会有太大帮助因为感兴趣的数据点没有被隔离成可以被 goquery 定位的元素。
不过,使用正则表达式可以很容易地解析出细节,当然可以根据需要修改(特别是if/when原始页面改变了它的HTML结构)。
type Movie struct {
Title string
ReleaseYear int
Rating string
RuntimeMinutes int
Description string
}
var movieregexp = regexp.MustCompile(`` +
`<h5><a.*?>\s*(.*?)\s*</a>\s*\((\d{4})\)</h5>` + // Title and release year
`[\s\S]*?Rating: (.*?)<` +
`[\s\S]*?Running Time \(minutes\): (\d{1,3})` +
`[\s\S]*?Description: ([\s\S]*?)<`)
// Returns a slice of movies parsed from the given string, possibly empty.
func ParseMovies(s string) []Movie {
movies := []Movie{}
groups := movieregexp.FindAllStringSubmatch(s, -1)
if groups != nil {
for _, group := range groups {
// We know these integers parse correctly because of the regex.
year, _ := strconv.Atoi(group[2])
runtime, _ := strconv.Atoi(group[4])
// Append the new movie to the list.
movies = append(movies, Movie{
Title: group[1],
ReleaseYear: year,
Rating: group[3],
RuntimeMinutes: runtime,
Description: group[5],
})
}
}
return movies
}
我不喜欢使用正则表达式来解析 html。我觉得它对于标签顺序之类的微小变化来说太脆弱了。
我认为最好还是回到html.Node(golang.org/x/net/html),这是goquery的基础。这个想法是迭代 siblings 直到它用完,或者遇到下一个 h5
。处理链接或任何其他元素标签可能有点麻烦,因为 html.Node 提供了相当不友好的 api 关于属性,但从它切换回 goquery 更麻烦。
package main
import (
"fmt"
"github.com/PuerkitoBio/goquery"
"golang.org/x/net/html"
"golang.org/x/net/html/atom"
"os"
"strings"
)
type Movie struct {
}
func (m Movie) addTitle(s string) {
fmt.Println("Title", s)
}
func (m Movie) addProperty(s string) {
if s == "" {
return
}
fmt.Println("Property", s)
}
var M []*Movie
func parseMovie(i int, s *goquery.Selection) {
m := &Movie{}
m.addTitle(s.Text())
loop:
for node := s.Nodes[0].NextSibling; node != nil; node = node.NextSibling {
switch node.Type {
case html.TextNode:
m.addProperty(strings.TrimSpace(node.Data))
case html.ElementNode:
switch node.DataAtom {
case atom.A:
//link, do something. You may want to transfer back to go query
fmt.Println(node.Attr)
case atom.Br:
continue
case atom.H5:
break loop
}
}
}
M = append(M, m)
}
func main() {
r, err := os.Open("movie.html")
if err != nil {
panic(err)
}
doc, err := goquery.NewDocumentFromReader(r)
if err != nil {
panic(err)
}
doc.Find("h5").Each(parseMovie)
}
我有一个 html 大致如下所示:
<h4>Movies</h4>
<h5><a href="external_link" target="_blank"> A Song For Jenny</a> (2015)</h5>
Rating: PG<br/>
Running Time (minutes): 77<br/>
Description: This Drama, based on real life events, tells the story of a family affected directly by the 7/7 London bombings. It shows love, loss, heartache and ...<br/>
<a href="/bmm/shop/Movie_Detail?movieid=2713288">More about A Song For Jenny</a><br/>
<a href="/bmm/shop/Edit_Movie?movieid=2713288">Edit A Song For Jenny</a><br/>
<br/>
<h5><a href="link" target="_blank">#RealityHigh</a> (2017)</h5>
Rating: PG<br/>
Running Time (minutes): 99<br/>
Description: High-achieving high-school senior Dani Barnes dreams of getting into UC Davis, the world's top veterinary school. Then a glamorous new friend draws ...<br/>
<a href="/bmm/shop/Movie_Detail?movieid=4089906">More about #RealityHigh</a><br/>
<a href="/bmm/shop/Edit_Movie?movieid=4089906">Edit #RealityHigh</a><br/>
<br/>
<h5><a href="link" target="_blank">1 Night</a> (2016)</h5>
Rating: PG<br/>
Running Time (minutes): 80<br/>
Description: Bea, a worrisome teenager, reconnects with her introverted childhood friend, Andy. The two overcome their differences in social status one night aft ...<br/>
<a href="/bmm/shop/Movie_Detail?movieid=3959071">More about 1 Night</a><br/>
<a href="/bmm/shop/Edit_Movie?movieid=3959071">Edit 1 Night</a><br/>
<br/>
<h5><a href="link" target="_blank">10 Cloverfield Lane</a> (2016)</h5>
Rating: PG<br/>
Running Time (minutes): 104<br/>
Description: Soon after leaving her fiancé Michelle is involved in a car accident. She awakens
to find herself sharing an underground bunker with Howard and Emme ...<br/>
<a href="/bmm/shop/Movie_Detail?movieid=3052189">More about 10 Cloverfield Lane</a><br/>
<a href="/bmm/shop/Edit_Movie?movieid=3052189">Edit 10 Cloverfield Lane</a><br/>
<br/>
我需要使用 goquery 从该页面获取尽可能多的信息。我知道如何提取此片段中由单词 "link" 替换的外部链接,我知道如何获取包含更多详细信息的链接,但我还想提取仅包含在文本中的信息,即年份(在标题),运行 时间,简短描述和 PG 评级。
我无法弄清楚如何在 goquery 中执行此操作,因为此文本未被任何 div 或其他标签包围。我试着寻找 h5 标签,然后在它们上调用 .Next() 但我只能找到 <br>
标签,而不是中间的文本。我怎样才能做到这一点?如果有比使用 goquery 更好的方法,我对此表示满意。
我的代码看起来像这样。
// Retrieve the page count:
res, err = http.Get("myUrlAddress")
if err != nil {
fmt.Println(err)
os.Exit(-1)
}
doc, err = goquery.NewDocumentFromResponse(res)
if err != nil {
fmt.Println(err)
os.Exit(-1)
}
links := doc.Find(`a[href*="pageIndex"]`)
fmt.Println(links.Length()) // Output page count
s := doc.Find("h5").First().Next() // I expect it to be the text after the heading.
fmt.Println(s.Text()) // But it's empty and if I check the node type it says br
不幸的是,由于此 HTML 页面的结构,在您确定包含示例中的电影列表的页面部分后,goquery 似乎不会有太大帮助因为感兴趣的数据点没有被隔离成可以被 goquery 定位的元素。
不过,使用正则表达式可以很容易地解析出细节,当然可以根据需要修改(特别是if/when原始页面改变了它的HTML结构)。
type Movie struct {
Title string
ReleaseYear int
Rating string
RuntimeMinutes int
Description string
}
var movieregexp = regexp.MustCompile(`` +
`<h5><a.*?>\s*(.*?)\s*</a>\s*\((\d{4})\)</h5>` + // Title and release year
`[\s\S]*?Rating: (.*?)<` +
`[\s\S]*?Running Time \(minutes\): (\d{1,3})` +
`[\s\S]*?Description: ([\s\S]*?)<`)
// Returns a slice of movies parsed from the given string, possibly empty.
func ParseMovies(s string) []Movie {
movies := []Movie{}
groups := movieregexp.FindAllStringSubmatch(s, -1)
if groups != nil {
for _, group := range groups {
// We know these integers parse correctly because of the regex.
year, _ := strconv.Atoi(group[2])
runtime, _ := strconv.Atoi(group[4])
// Append the new movie to the list.
movies = append(movies, Movie{
Title: group[1],
ReleaseYear: year,
Rating: group[3],
RuntimeMinutes: runtime,
Description: group[5],
})
}
}
return movies
}
我不喜欢使用正则表达式来解析 html。我觉得它对于标签顺序之类的微小变化来说太脆弱了。
我认为最好还是回到html.Node(golang.org/x/net/html),这是goquery的基础。这个想法是迭代 siblings 直到它用完,或者遇到下一个 h5
。处理链接或任何其他元素标签可能有点麻烦,因为 html.Node 提供了相当不友好的 api 关于属性,但从它切换回 goquery 更麻烦。
package main
import (
"fmt"
"github.com/PuerkitoBio/goquery"
"golang.org/x/net/html"
"golang.org/x/net/html/atom"
"os"
"strings"
)
type Movie struct {
}
func (m Movie) addTitle(s string) {
fmt.Println("Title", s)
}
func (m Movie) addProperty(s string) {
if s == "" {
return
}
fmt.Println("Property", s)
}
var M []*Movie
func parseMovie(i int, s *goquery.Selection) {
m := &Movie{}
m.addTitle(s.Text())
loop:
for node := s.Nodes[0].NextSibling; node != nil; node = node.NextSibling {
switch node.Type {
case html.TextNode:
m.addProperty(strings.TrimSpace(node.Data))
case html.ElementNode:
switch node.DataAtom {
case atom.A:
//link, do something. You may want to transfer back to go query
fmt.Println(node.Attr)
case atom.Br:
continue
case atom.H5:
break loop
}
}
}
M = append(M, m)
}
func main() {
r, err := os.Open("movie.html")
if err != nil {
panic(err)
}
doc, err := goquery.NewDocumentFromReader(r)
if err != nil {
panic(err)
}
doc.Find("h5").Each(parseMovie)
}