xidel如何实现分页html和分页URL?
xidel how to follow pagination html and extract URL?
关于windows 7 批量和 xidel
我在一个像这个例子这样的分页网站上测试:
链接1
链接2
链接3
1 2 3 4 5 6 7 8 9 10 下一个
我找到了获得前 10 个链接的方法:
xidel.exe https://www.website.es/search?q=xidel+follow+pagination^&start=0 --extract "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']"
但是当我尝试使用
进入第 2 页或第 (n) 页时
-f "<A class="fl">{.}</A>"
或
--follow "//a/[@class='nav']"
nothink 工作,你能给我一些帮助或一些例子吗?
谢谢。
url 查询字符串中的搜索词
通过这个简单的查询...
xidel "https://www.google.com/search?q=xidel+follow+pagination" -e "$url"
https://consent.google.com/ml?continue=[...]
...您会注意到我们遇到了饼干墙。使用 -f "//form"
Xidel 可以“点击”同意按钮。
提取 urls:
xidel -s "https://www.google.com/search?q=xidel+follow+pagination" ^
-f "//form" -e "//div[@class='egMi0 kCrYT']/a/@href"
/url?q=
/url?q=https://whosebug.com/tags/xidel/hot%3Ffilter%3Dall&sa=U&ved=2ahUKEwjQ7eCblIL4AhXCjqQKHVOcCNoQFnoECAIQAg&usg=AOvVaw25MiKPwJB0jVHz2JTl5mBp
/url?q=https://www.adoclib.com/blog/how-to-extract-using-xidel-all-srcset-width-strings-from-an.html&sa=U&ved=2ahUKEwjQ7eCblIL4AhXCjqQKHVOcCNoQFnoECAgQAg&usg=AOvVaw3BfrZCAGHHs_nqpJ-1aj2u
[...]
xidel -s "https://www.google.com/search?q=xidel+follow+pagination" ^
-f "//form" -e "//div[@class='egMi0 kCrYT']/a/resolve-uri(@href)"
https://www.google.com/url?q=
https://www.google.com/url?q=https://whosebug.com/tags/xidel/hot%3Ffilter%3Dall&sa=U&ved=2ahUKEwif0IL0mIL4AhUHtKQKHSh7DhoQFnoECAgQAg&usg=AOvVaw19rnj9nPwMX-zKVSNzacrw
https://www.google.com/url?q=https://www.adoclib.com/blog/how-to-extract-using-xidel-all-srcset-width-strings-from-an.html&sa=U&ved=2ahUKEwif0IL0mIL4AhUHtKQKHSh7DhoQFnoECAcQAg&usg=AOvVaw3T4VVe92ucN0Jc7hzvAn8Y
[...]
xidel -s "https://www.google.com/search?q=xidel+follow+pagination" ^
-f "//form" -e "//div[@class='egMi0 kCrYT']/a/request-decode(resolve-uri(@href))"
{
"url": "https://www.google.com/url?q=
"protocol": "https",
"host": "www.google.com",
"path": "url",
"query": "q=
"params": {
"q": "
"sa": "U",
"ved": "2ahUKEwid9bHXmYL4AhWEIMUKHabxAoAQFnoECAAQAg",
"usg": "AOvVaw1qftOzBqM1OfXkWkkJm0B8"
}
}
[...]
xidel -s "https://www.google.com/search?q=xidel+follow+pagination" ^
-f "//form" -e "//div[@class='egMi0 kCrYT']/a/request-decode(resolve-uri(@href))/params/q"
https://whosebug.com/tags/xidel/hot?filter=all
https://www.adoclib.com/blog/how-to-extract-using-xidel-all-srcset-width-strings-from-an.html
[...]
跟随分页:
上面的最终命令从第 1st 个结果页面中提取 urls。要包含来自其他结果页面的 urls,您可以执行“递归跟随”:
xidel -s "https://www.google.com/search?q=xidel+follow+pagination" ^
-f "//form" -e "//div[@class='egMi0 kCrYT']/a/request-decode(resolve-uri(@href))/params/q" ^
-f "//a[@aria-label and contains(.,'>')]"
-f "//a[@aria-label and contains(.,'>')]"
“点击”下一页按钮,直到没有更多为止。
不过请注意 Xidel 作者的警告:!!! Recursive follow is deprecated and might be removed soon. !!!
.
通过 form()
搜索字词
更好的选择是访问主页并通过 form()
提交搜索词。需要一个用户代理,但 cookie-consent-button 会自动“点击”并且 HTML-source 更容易解析。
提取 urls:
xidel -s --user-agent "Mozilla/5.0 Firefox/100.0" "https://www.google.com" ^
-f "form(//form,{'q':'xidel follow pagination'})" -e "//div[@class='yuRUbf']/a/@href"
https://whosebug.com/tags/xidel/hot?filter=all
https://www.adoclib.com/blog/how-to-extract-using-xidel-all-srcset-width-strings-from-an.html
[...]
跟随分页:
这个可以通过另一个“递归跟随”来完成:
xidel -s --user-agent "Mozilla/5.0 Firefox/100.0" "https://www.google.com" ^
-f "form(//form,{'q':'xidel follow pagination'})" -e "//div[@class='yuRUbf']/a/@href" ^
-f "//a[@id='pnnext']/@href"
但在这种情况下更改 form()
参数要容易得多:
xidel -s --user-agent "Mozilla/5.0 Firefox/100.0" "https://www.google.com" ^
-f "form(//form,{'q':'xidel follow pagination','num':'100'})" -e "//div[@class='yuRUbf']/a/@href"
我不知道 num
是否有硬限制,但 100 似乎至少有效。
雷诺是对的。但是查询 Google 也可以这样做:
xidel -s "https://www.google.com" ^
-f "form(//form,{'q':'xidel follow pagination','num':'25'})" ^
-e "//a/extract(@href,'url\?q=(.+?)&',1)[.]"
关于windows 7 批量和 xidel 我在一个像这个例子这样的分页网站上测试:
链接1
链接2
链接3
1 2 3 4 5 6 7 8 9 10 下一个
我找到了获得前 10 个链接的方法:
xidel.exe https://www.website.es/search?q=xidel+follow+pagination^&start=0 --extract "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']"
但是当我尝试使用
进入第 2 页或第 (n) 页时-f "<A class="fl">{.}</A>"
或
--follow "//a/[@class='nav']"
nothink 工作,你能给我一些帮助或一些例子吗?
谢谢。
url 查询字符串中的搜索词
通过这个简单的查询...
xidel "https://www.google.com/search?q=xidel+follow+pagination" -e "$url"
https://consent.google.com/ml?continue=[...]
...您会注意到我们遇到了饼干墙。使用 -f "//form"
Xidel 可以“点击”同意按钮。
提取 urls:
xidel -s "https://www.google.com/search?q=xidel+follow+pagination" ^
-f "//form" -e "//div[@class='egMi0 kCrYT']/a/@href"
/url?q=
/url?q=https://whosebug.com/tags/xidel/hot%3Ffilter%3Dall&sa=U&ved=2ahUKEwjQ7eCblIL4AhXCjqQKHVOcCNoQFnoECAIQAg&usg=AOvVaw25MiKPwJB0jVHz2JTl5mBp
/url?q=https://www.adoclib.com/blog/how-to-extract-using-xidel-all-srcset-width-strings-from-an.html&sa=U&ved=2ahUKEwjQ7eCblIL4AhXCjqQKHVOcCNoQFnoECAgQAg&usg=AOvVaw3BfrZCAGHHs_nqpJ-1aj2u
[...]
xidel -s "https://www.google.com/search?q=xidel+follow+pagination" ^
-f "//form" -e "//div[@class='egMi0 kCrYT']/a/resolve-uri(@href)"
https://www.google.com/url?q=
https://www.google.com/url?q=https://whosebug.com/tags/xidel/hot%3Ffilter%3Dall&sa=U&ved=2ahUKEwif0IL0mIL4AhUHtKQKHSh7DhoQFnoECAgQAg&usg=AOvVaw19rnj9nPwMX-zKVSNzacrw
https://www.google.com/url?q=https://www.adoclib.com/blog/how-to-extract-using-xidel-all-srcset-width-strings-from-an.html&sa=U&ved=2ahUKEwif0IL0mIL4AhUHtKQKHSh7DhoQFnoECAcQAg&usg=AOvVaw3T4VVe92ucN0Jc7hzvAn8Y
[...]
xidel -s "https://www.google.com/search?q=xidel+follow+pagination" ^
-f "//form" -e "//div[@class='egMi0 kCrYT']/a/request-decode(resolve-uri(@href))"
{
"url": "https://www.google.com/url?q=
"protocol": "https",
"host": "www.google.com",
"path": "url",
"query": "q=
"params": {
"q": "
"sa": "U",
"ved": "2ahUKEwid9bHXmYL4AhWEIMUKHabxAoAQFnoECAAQAg",
"usg": "AOvVaw1qftOzBqM1OfXkWkkJm0B8"
}
}
[...]
xidel -s "https://www.google.com/search?q=xidel+follow+pagination" ^
-f "//form" -e "//div[@class='egMi0 kCrYT']/a/request-decode(resolve-uri(@href))/params/q"
https://whosebug.com/tags/xidel/hot?filter=all
https://www.adoclib.com/blog/how-to-extract-using-xidel-all-srcset-width-strings-from-an.html
[...]
跟随分页:
上面的最终命令从第 1st 个结果页面中提取 urls。要包含来自其他结果页面的 urls,您可以执行“递归跟随”:
xidel -s "https://www.google.com/search?q=xidel+follow+pagination" ^
-f "//form" -e "//div[@class='egMi0 kCrYT']/a/request-decode(resolve-uri(@href))/params/q" ^
-f "//a[@aria-label and contains(.,'>')]"
-f "//a[@aria-label and contains(.,'>')]"
“点击”下一页按钮,直到没有更多为止。
不过请注意 Xidel 作者的警告:!!! Recursive follow is deprecated and might be removed soon. !!!
.
通过 form()
搜索字词
更好的选择是访问主页并通过 form()
提交搜索词。需要一个用户代理,但 cookie-consent-button 会自动“点击”并且 HTML-source 更容易解析。
提取 urls:
xidel -s --user-agent "Mozilla/5.0 Firefox/100.0" "https://www.google.com" ^
-f "form(//form,{'q':'xidel follow pagination'})" -e "//div[@class='yuRUbf']/a/@href"
https://whosebug.com/tags/xidel/hot?filter=all
https://www.adoclib.com/blog/how-to-extract-using-xidel-all-srcset-width-strings-from-an.html
[...]
跟随分页:
这个可以通过另一个“递归跟随”来完成:
xidel -s --user-agent "Mozilla/5.0 Firefox/100.0" "https://www.google.com" ^
-f "form(//form,{'q':'xidel follow pagination'})" -e "//div[@class='yuRUbf']/a/@href" ^
-f "//a[@id='pnnext']/@href"
但在这种情况下更改 form()
参数要容易得多:
xidel -s --user-agent "Mozilla/5.0 Firefox/100.0" "https://www.google.com" ^
-f "form(//form,{'q':'xidel follow pagination','num':'100'})" -e "//div[@class='yuRUbf']/a/@href"
我不知道 num
是否有硬限制,但 100 似乎至少有效。
雷诺是对的。但是查询 Google 也可以这样做:
xidel -s "https://www.google.com" ^
-f "form(//form,{'q':'xidel follow pagination','num':'25'})" ^
-e "//a/extract(@href,'url\?q=(.+?)&',1)[.]"