使用 beautifulsoup(未知 url 类型)遍历 url 的列表以使用 python 进行网页抓取
Iterate through a list of urls for web scraping with python using beautifulsoup (unknown url type)
我正在尝试从我拥有的列表中抓取每个 url 的内容,这没有问题,我的列表工作正常,
原来的link是这样的:https://www.lamudi.com.mx/nuevo-leon/departamento/for-rent/
tags = soup('a',{'class':'js-listing-link'})
for tag in tags:
linktag = tag.get('href').strip()
if linktag not in linklist:
linklist.append(linktag)
上面的结果是 urls 的列表作为字符串。但后来我试试这个:
for link in linklist[0]:
page2=urllib.request.Request(link,headers={'User-Agent': 'Mozilla/5.0'})
myhtml2 = urllib.request.urlopen(page2).read()
soupfl = BeautifulSoup(myhtml2, 'html.parser')
只是为了证明一切正常,但我得到一个错误:
raise ValueError("unknown url type: %r" % self.full_url)
ValueError:未知 url 类型:'h'
要获取所有链接,您可以使用此示例:
import urllib.request
from bs4 import BeautifulSoup
URL = "https://www.lamudi.com.mx/nuevo-leon/departamento/for-rent/"
HEADERS = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36"
}
r = urllib.request.Request(URL, headers=HEADERS)
soup = BeautifulSoup(urllib.request.urlopen(r).read(), "html.parser")
tags = soup.find_all("a", {"class": "js-listing-link"})
links = []
[links.append(link["href"]) for link in tags if link["href"] not in links]
for link in links:
print("Getting:", link)
r2 = urllib.request.Request(link, headers=HEADERS)
soup2 = BeautifulSoup(urllib.request.urlopen(r2).read(), "html.parser")
我正在尝试从我拥有的列表中抓取每个 url 的内容,这没有问题,我的列表工作正常,
原来的link是这样的:https://www.lamudi.com.mx/nuevo-leon/departamento/for-rent/
tags = soup('a',{'class':'js-listing-link'})
for tag in tags:
linktag = tag.get('href').strip()
if linktag not in linklist:
linklist.append(linktag)
上面的结果是 urls 的列表作为字符串。但后来我试试这个:
for link in linklist[0]:
page2=urllib.request.Request(link,headers={'User-Agent': 'Mozilla/5.0'})
myhtml2 = urllib.request.urlopen(page2).read()
soupfl = BeautifulSoup(myhtml2, 'html.parser')
只是为了证明一切正常,但我得到一个错误:
raise ValueError("unknown url type: %r" % self.full_url)
ValueError:未知 url 类型:'h'
要获取所有链接,您可以使用此示例:
import urllib.request
from bs4 import BeautifulSoup
URL = "https://www.lamudi.com.mx/nuevo-leon/departamento/for-rent/"
HEADERS = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36"
}
r = urllib.request.Request(URL, headers=HEADERS)
soup = BeautifulSoup(urllib.request.urlopen(r).read(), "html.parser")
tags = soup.find_all("a", {"class": "js-listing-link"})
links = []
[links.append(link["href"]) for link in tags if link["href"] not in links]
for link in links:
print("Getting:", link)
r2 = urllib.request.Request(link, headers=HEADERS)
soup2 = BeautifulSoup(urllib.request.urlopen(r2).read(), "html.parser")