正则表达式和来自 HTML 的 urllib.request 到 scrape 的链接

Question

我正在尝试解析 HTML 以提取此正则表达式构造中的所有值： href="http//.+?"

这是代码：

import urllib.request
import re

url = input('Enter - ')
html = urllib.request.urlopen(url).read()
links = re.findall('href="(http://.*?)"',html)
for link in links:
    print(link)

但我收到一条错误消息：类型错误：不能在类似字节的对象上使用字符串模式

Answer 1

你的html是一个字节串，使用str(html):

re.findall(r'href="(http://.*?)"',str(html))

或者，使用字节模式：

re.findall(rb'href="(http://.*?)"',html)

Answer 2

urlopen(url) returns 字节对象。所以你的 html 变量也包含字节。您可以使用类似这样的方式对其进行解码：

htmlobject = urllib.request.urlopen(url)
html = htmlobject.read().decode('utf-8')

然后你可以在你的正则表达式中使用 html 现在是一个字符串。

Answer 3

如果您想要一个页面上的所有链接，您甚至不必使用正则表达式。因为你可以只使用 bs4 来获得你需要的东西:-)

import requests
import bs4
soup = bs4.BeautifulSoup(requests.get('https://dr.dk/').text, 'lxml')
links = soup.find_all('a', href=True)
[print(i['href']) for i in links]

希望对您有所帮助。祝项目顺利 ;-)

正则表达式和来自 HTML 的 urllib.request 到 scrape 的链接

regex and urllib.request to scrape links from HTML

python

regex

urllib

html-parsing

正则表达式和来自 HTML 的 urllib.request 到 __scrape__ 的链接

regex and urllib.request to __scrape__ links from HTML

python

regex

urllib

html-parsing

正则表达式和来自 HTML 的 urllib.request 到 scrape 的链接

regex and urllib.request to scrape links from HTML