xidel如何实现分页html和分页URL?

xidel how to follow pagination html and extract URL?

关于windows 7 批量和 xidel 我在一个像这个例子这样的分页网站上测试:

链接1

链接2

链接3

1 2 3 4 5 6 7 8 9 10 下一个

我找到了获得前 10 个链接的方法:

xidel.exe https://www.website.es/search?q=xidel+follow+pagination^&start=0 --extract "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']"

但是当我尝试使用

进入第 2 页或第 (n) 页时
-f "<A class="fl">{.}</A>"

--follow "//a/[@class='nav']"

nothink 工作,你能给我一些帮助或一些例子吗?

谢谢。

url 查询字符串中的搜索词

通过这个简单的查询...

xidel "https://www.google.com/search?q=xidel+follow+pagination" -e "$url"
https://consent.google.com/ml?continue=[...]

...您会注意到我们遇到了饼干墙。使用 -f "//form" Xidel 可以“点击”同意按钮。

提取 urls:
xidel -s "https://www.google.com/search?q=xidel+follow+pagination" ^
      -f "//form" -e "//div[@class='egMi0 kCrYT']/a/@href"
/url?q=
/url?q=https://whosebug.com/tags/xidel/hot%3Ffilter%3Dall&sa=U&ved=2ahUKEwjQ7eCblIL4AhXCjqQKHVOcCNoQFnoECAIQAg&usg=AOvVaw25MiKPwJB0jVHz2JTl5mBp
/url?q=https://www.adoclib.com/blog/how-to-extract-using-xidel-all-srcset-width-strings-from-an.html&sa=U&ved=2ahUKEwjQ7eCblIL4AhXCjqQKHVOcCNoQFnoECAgQAg&usg=AOvVaw3BfrZCAGHHs_nqpJ-1aj2u
[...]

xidel -s "https://www.google.com/search?q=xidel+follow+pagination" ^
      -f "//form" -e "//div[@class='egMi0 kCrYT']/a/resolve-uri(@href)"
https://www.google.com/url?q=
https://www.google.com/url?q=https://whosebug.com/tags/xidel/hot%3Ffilter%3Dall&sa=U&ved=2ahUKEwif0IL0mIL4AhUHtKQKHSh7DhoQFnoECAgQAg&usg=AOvVaw19rnj9nPwMX-zKVSNzacrw
https://www.google.com/url?q=https://www.adoclib.com/blog/how-to-extract-using-xidel-all-srcset-width-strings-from-an.html&sa=U&ved=2ahUKEwif0IL0mIL4AhUHtKQKHSh7DhoQFnoECAcQAg&usg=AOvVaw3T4VVe92ucN0Jc7hzvAn8Y
[...]

xidel -s "https://www.google.com/search?q=xidel+follow+pagination" ^
      -f "//form" -e "//div[@class='egMi0 kCrYT']/a/request-decode(resolve-uri(@href))"
{
  "url": "https://www.google.com/url?q=
  "protocol": "https",
  "host": "www.google.com",
  "path": "url",
  "query": "q=
  "params": {
    "q": "
    "sa": "U",
    "ved": "2ahUKEwid9bHXmYL4AhWEIMUKHabxAoAQFnoECAAQAg",
    "usg": "AOvVaw1qftOzBqM1OfXkWkkJm0B8"
  }
}
[...]

xidel -s "https://www.google.com/search?q=xidel+follow+pagination" ^
      -f "//form" -e "//div[@class='egMi0 kCrYT']/a/request-decode(resolve-uri(@href))/params/q"

https://whosebug.com/tags/xidel/hot?filter=all
https://www.adoclib.com/blog/how-to-extract-using-xidel-all-srcset-width-strings-from-an.html
[...]
跟随分页:

上面的最终命令从第 1st 个结果页面中提取 urls。要包含来自其他结果页面的 urls,您可以执行“递归跟随”:

xidel -s "https://www.google.com/search?q=xidel+follow+pagination" ^
      -f "//form" -e "//div[@class='egMi0 kCrYT']/a/request-decode(resolve-uri(@href))/params/q" ^
      -f "//a[@aria-label and contains(.,'>')]"

-f "//a[@aria-label and contains(.,'>')]"“点击”下一页按钮,直到没有更多为止。
不过请注意 Xidel 作者的警告:!!! Recursive follow is deprecated and might be removed soon. !!!.

通过 form()

搜索字词

更好的选择是访问主页并通过 form() 提交搜索词。需要一个用户代理,但 cookie-consent-button 会自动“点击”并且 HTML-source 更容易解析。

提取 urls:
xidel -s --user-agent "Mozilla/5.0 Firefox/100.0" "https://www.google.com" ^
      -f "form(//form,{'q':'xidel follow pagination'})" -e "//div[@class='yuRUbf']/a/@href"

https://whosebug.com/tags/xidel/hot?filter=all
https://www.adoclib.com/blog/how-to-extract-using-xidel-all-srcset-width-strings-from-an.html
[...]
跟随分页:

这个可以通过另一个“递归跟随”来完成:

xidel -s --user-agent "Mozilla/5.0 Firefox/100.0" "https://www.google.com" ^
      -f "form(//form,{'q':'xidel follow pagination'})" -e "//div[@class='yuRUbf']/a/@href" ^
      -f "//a[@id='pnnext']/@href"

但在这种情况下更改 form() 参数要容易得多:

xidel -s --user-agent "Mozilla/5.0 Firefox/100.0" "https://www.google.com" ^
      -f "form(//form,{'q':'xidel follow pagination','num':'100'})" -e "//div[@class='yuRUbf']/a/@href"

我不知道 num 是否有硬限制,但 100 似乎至少有效。

雷诺是对的。但是查询 Google 也可以这样做:

xidel -s "https://www.google.com" ^
      -f "form(//form,{'q':'xidel follow pagination','num':'25'})" ^
      -e "//a/extract(@href,'url\?q=(.+?)&',1)[.]"