Python 在 google 中搜索以特定词结尾的网站

Question

我试图搜索 Google 中所有以“gencat.cat”结尾的网站。

我的代码：

import requests, lxml
from bs4 import BeautifulSoup

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {'q': 'gencat.cat'}
html = requests.get('https://www.google.com/search', headers=headers, params=params).text
soup = BeautifulSoup(html, 'lxml')

# containver with all needed data
for result in soup.select('.tF2Cxc'):
    link = result.a['href'] # or ('.yuRUbf a')['href']
    print(link)

我的输出:

问题是只搜索了几个网站，而且它使用了一些没有“gencat.cat”的网址或重复来自同一网站的页面：

https://web.gencat.cat/ca/inici
https://web.gencat.cat/es/inici/
https://web.gencat.cat/ca/tramits
https://web.gencat.cat/en/inici/index.html
https://govern.cat/
https://govern.cat/salapremsa/
http://www.gencat.es/
http://www.regencos.cat/promocio-variable/preguntes-mes-frequents-sobre-el-coronavirus/
https://tauler.seu.cat/inici.do?idens=1

我想要的输出:

https://web.gencat.cat
http://agricultura.gencat.cat
http://cultura.gencat.cat
https://dretssocials.gencat.cat
http://economia.gencat.cat

Answer 1

如果您想要顶级域，可以在 link 变量中的所有“/”实例上拆分 link。

for result in soup.select('.tF2Cxc'):
link = result.a['href'] # or ('.yuRUbf a')['href']
print(link)

string_splt = link.split("/")
TLD = f"https://{string_splt[2]}"

print(TLD)

我相信有更好的方法可以将它们重新组合在一起，但这似乎有效。您还需要处理重复项。

Python 在 google 中搜索以特定词结尾的网站

Python search website in google that end with specific word

python

beautifulsoup

google-search

web-scraping