Python 抓取网页

Question

我试图从网页中逐行提取 link 及其文本，然后将文本和 link 插入字典。不使用漂亮的汤或正则表达式。

我一直收到这个错误：

错误：

 Traceback (most recent call last):
 File "F:/Homework7-2.py", line 13, in <module>
 link2 = link1.split("href=")[1]
 IndexError: list index out of range

代码：

import urllib.request
url = "http://www.facebook.com" 
page = urllib.request.urlopen(url)
mylinks = {}
links = page.readline().decode('utf-8')


for items in links:
  links = page.readline().decode('utf-8')
  if "a href=" in links:
     links = page.readline().decode('utf-8')
     link1 = links.split(">")[0]
     link2 = link1.split("href=")[1]
     mylinks = link2
     print(mylinks)

Answer 1

import requests

from bs4 import BeautifulSoup

r = requests.get("")
#  find all a tags with href attributes
for a in BeautifulSoup(r.content).find_all("a",href=True):
    # print each href
    print(a["href"])

显然，这是一个非常广泛的示例，但可以帮助您入门，如果您需要特定的 URL，您可以将搜索范围缩小到某些元素，但所有网页都会有所不同。您找不到比 requests and BeautifulSoup

更容易用于解析的工具

Python 抓取网页

Python scraping webpages

python

webpage

web-scraping