Python 抓取网页
Python scraping webpages
我试图从网页中逐行提取 link 及其文本,然后将文本和 link 插入字典。不使用漂亮的汤或正则表达式。
我一直收到这个错误:
错误:
Traceback (most recent call last):
File "F:/Homework7-2.py", line 13, in <module>
link2 = link1.split("href=")[1]
IndexError: list index out of range
代码:
import urllib.request
url = "http://www.facebook.com"
page = urllib.request.urlopen(url)
mylinks = {}
links = page.readline().decode('utf-8')
for items in links:
links = page.readline().decode('utf-8')
if "a href=" in links:
links = page.readline().decode('utf-8')
link1 = links.split(">")[0]
link2 = link1.split("href=")[1]
mylinks = link2
print(mylinks)
import requests
from bs4 import BeautifulSoup
r = requests.get("")
# find all a tags with href attributes
for a in BeautifulSoup(r.content).find_all("a",href=True):
# print each href
print(a["href"])
显然,这是一个非常广泛的示例,但可以帮助您入门,如果您需要特定的 URL,您可以将搜索范围缩小到某些元素,但所有网页都会有所不同。您找不到比 requests and BeautifulSoup
更容易用于解析的工具
我试图从网页中逐行提取 link 及其文本,然后将文本和 link 插入字典。不使用漂亮的汤或正则表达式。
我一直收到这个错误:
错误:
Traceback (most recent call last):
File "F:/Homework7-2.py", line 13, in <module>
link2 = link1.split("href=")[1]
IndexError: list index out of range
代码:
import urllib.request
url = "http://www.facebook.com"
page = urllib.request.urlopen(url)
mylinks = {}
links = page.readline().decode('utf-8')
for items in links:
links = page.readline().decode('utf-8')
if "a href=" in links:
links = page.readline().decode('utf-8')
link1 = links.split(">")[0]
link2 = link1.split("href=")[1]
mylinks = link2
print(mylinks)
import requests
from bs4 import BeautifulSoup
r = requests.get("")
# find all a tags with href attributes
for a in BeautifulSoup(r.content).find_all("a",href=True):
# print each href
print(a["href"])
显然,这是一个非常广泛的示例,但可以帮助您入门,如果您需要特定的 URL,您可以将搜索范围缩小到某些元素,但所有网页都会有所不同。您找不到比 requests and BeautifulSoup
更容易用于解析的工具