使用 Beautiful soup 查找公司名称和地址的 phone 数字
Using Beautiful soup to find a phone number for company name and address
我有一个脚本可以在网站上抓取西班牙公司的名称、地区和省份。在 html 中还有另一个 link,它会将您带到包含 phone 编号的页面,但是当我试图刮掉 html 时,它会打印“none”。有没有办法让脚本自动跳转到页面,抓取号码和公司行匹配?
import requests
from googlesearch import search
from bs4 import BeautifulSoup
for page in range(1,65):
url = "https://www.expansion.com/empresas-de/ganaderia/granjas-en-general/{page}.html".format(page =page)
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
lists = soup.select("div#simulacion_tabla ul")
#scrape the list
for lis in lists:
title = lis.find('li', class_="col1").text
location = lis.find('li', class_="col2").text
province = lis.find('li', class_="col3").text
link = lis.find('href', class_ ="col1")
info = [title, location, province, link]
print(info)
或者,有没有一种方法可以使用 googlesearch 库来实现?
非常感谢
首先url"https://www.expansion.com/empresas-de/ganaderia/granjas-en-general/index.html"
不是
"https://www.expansion.com/empresas-de/ganaderia/granjas-en-general/1.html"
因此,您的脚本不会 return 输出。
你可以这样试试
import requests
# from googlesearch import search
from bs4 import BeautifulSoup
baseurl = ["https://www.expansion.com/empresas-de/ganaderia/granjas-en-general/index.html"]
urls = [f'https://www.expansion.com/empresas-de/ganaderia/granjas-en-general/{i}.html'.format(i) for i in range(2,5)]
allurls = baseurl + urls
print(allurls)
for url in allurls:
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
lists = soup.select("div#simulacion_tabla ul")
#scrape the list
for lis in lists:
title = lis.find('li', class_="col1").text
location = lis.find('li', class_="col2").text
province = lis.find('li', class_="col3").text
link = lis.select("li.col1 a")[0]['href']
info = [title, location, province, link]
print(info)
我有一个脚本可以在网站上抓取西班牙公司的名称、地区和省份。在 html 中还有另一个 link,它会将您带到包含 phone 编号的页面,但是当我试图刮掉 html 时,它会打印“none”。有没有办法让脚本自动跳转到页面,抓取号码和公司行匹配?
import requests
from googlesearch import search
from bs4 import BeautifulSoup
for page in range(1,65):
url = "https://www.expansion.com/empresas-de/ganaderia/granjas-en-general/{page}.html".format(page =page)
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
lists = soup.select("div#simulacion_tabla ul")
#scrape the list
for lis in lists:
title = lis.find('li', class_="col1").text
location = lis.find('li', class_="col2").text
province = lis.find('li', class_="col3").text
link = lis.find('href', class_ ="col1")
info = [title, location, province, link]
print(info)
或者,有没有一种方法可以使用 googlesearch 库来实现?
非常感谢
首先url"https://www.expansion.com/empresas-de/ganaderia/granjas-en-general/index.html"
不是
"https://www.expansion.com/empresas-de/ganaderia/granjas-en-general/1.html"
因此,您的脚本不会 return 输出。
你可以这样试试
import requests
# from googlesearch import search
from bs4 import BeautifulSoup
baseurl = ["https://www.expansion.com/empresas-de/ganaderia/granjas-en-general/index.html"]
urls = [f'https://www.expansion.com/empresas-de/ganaderia/granjas-en-general/{i}.html'.format(i) for i in range(2,5)]
allurls = baseurl + urls
print(allurls)
for url in allurls:
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
lists = soup.select("div#simulacion_tabla ul")
#scrape the list
for lis in lists:
title = lis.find('li', class_="col1").text
location = lis.find('li', class_="col2").text
province = lis.find('li', class_="col3").text
link = lis.select("li.col1 a")[0]['href']
info = [title, location, province, link]
print(info)