Python:通过 html 文件搜索 <a> 带有 href 和文本内容的标签

Python: searching through html file grabbing <a> tags with the href and text content

我需要一个解决方案的帮助,以使用 Python3 搜索 html 文件并检索页面上的所有 <a> 链接。然后将抓取的值附加到具有相邻 href (url) 的字典中。


import urllib3
import re

http = urllib3.PoolManager()
my_url = ""
a = http.request("GET",my_url)
html =

links = re.finditer(' href="?([^\s^"]+)', html)

for link in links:


TypeError: can't use a string pattern on a bytes-like object


我也试过 lxml...

links = lxml.html.parse("").xpath("//a/@href")
for link in links:




    def news_feed(self, stock):
    http = urllib3.PoolManager()
    my_url = ""+stock
    a = http.request("GET",my_url)
    html ='utf-8')
    xml = fromstring(html, HTMLParser())
    a_tags = xml.xpath("//a/@href")
    xml = fromstring(html, HTMLParser())
    a_tags = xml.xpath("//table[@id='yfncsumtab']//a")
    self.paired = dict((a.xpath(".//text()")[0].strip(), a.xpath("./@href")[0]) for a in a_tags)

使用 html 解析器并按照建议解码字节,BeautifulSoup 将使工作变得非常简单,并且在解析 html 时比正则表达式更可靠:

http = urllib3.PoolManager()
my_url = ""
a = http.request("GET", my_url)
html ="utf-8")

from bs4 import BeautifulSoup

print([a["href"] for a in BeautifulSoup(html).find_all("a",href=True)])

如果您只想要以 http 开头的链接,您可以使用 css select:

soup = BeautifulSoup(html)

print([a["href"] for a in"a[href^=http]")])


['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']

获取文本和 href:

soup = BeautifulSoup(html)

a_tags ="a[href^=http]")
from pprint import pprint as pp
paired = dict((a.text, a["href"]) for a in a_tags)



 {u'Alphabet overtakes Apple in market value - for now': '',
 u'Alphabet passes Apple to become most valuable traded U.S. company': '',
 u'Apple bets new 4-inch iPhone to draw big-screen converts in China, India': '',
 u'Apple iPhone sales weaker than expected': '',
 u'Apple likely to invoke free-speech rights in encryption fight': '',
 u'Apple sees first sales dip in more than a decade as super-growth era falters': '',
 u'Apple shares fall most in two years in wake of earnings report': '',
 u"Apple's new iPhone faces challenge measuring up in China, India": '',
 u"Bad run continues for 'Freedom 251', website down again on second day": '',
 u'Capital IQ': '',
 u'Commodity Systems, Inc. (CSI)': '',
 u'Download the new Yahoo Mail app': '',
 u"EXCLUSIVE - Common mobile software could have opened San Bernardino shooter's iPhone": '',
 u'Foxconn says in talks with Sharp over deal hitch; hopes for agreement': '',
 u'Help': '',
 u'Mail': '',
 u'Markets': '',
 u'Morningstar, Inc.': '',
 u'My Yahoo': '',
 u'New User? Register': '',
 u'Report an Issue': '',
 u'Samsung Elec warns of difficult 2016 as smartphone troubles spread': '',
 u'Samsung wins appeal in patent dispute with Apple': '',
 u'Sign In': '',
 u"Signs of life for Apple's stock as Wall St eyes new iPhone": '',
 u'Solid support for Apple in iPhone encryption fight - Reuters/Ipsos': '',
 u'U.S. appeals court upholds Apple e-book settlement': '',
 u'U.S., Apple ratchet up rhetoric in fight over encryption': '',
 u'With China weakening, Apple turns to India': '',
 u'Yahoo': '',
 u'Yahoo India Finance': '',
 u'other exchanges': '',
 u'premium service.': ''}

a[href^=http] 意味着给我所有具有 href 的 a 标签,这些 href 的值以 http.


使用 lxml 并使用 table id 获取您可能最感兴趣的故事链接:

from lxml.etree  import fromstring, HTMLParser

xml = fromstring(_html, HTMLParser())

a_tags = xml.xpath("//table[@id='yfncsumtab']//a")

paired = dict((a.xpath(".//text()")[0].strip(), a.xpath("./@href")[0]) for a in a_tags)
from pprint import pprint as pp


{'Alphabet overtakes Apple in market value - for now': '',
 'Alphabet passes Apple to become most valuable traded U.S. company': '',
 'Apple bets new 4-inch iPhone to draw big-screen converts in China, India': '',
 'Apple iPhone sales weaker than expected': '',
 'Apple likely to invoke free-speech rights in encryption fight': '',
 'Apple sees first sales dip in more than a decade as super-growth era falters': '',
 'Apple shares fall most in two years in wake of earnings report': '',
 "Apple's new iPhone faces challenge measuring up in China, India": '',
 "Bad run continues for 'Freedom 251', website down again on second day": '',
 "EXCLUSIVE - Common mobile software could have opened San Bernardino shooter's iPhone": '',
 'Foxconn says in talks with Sharp over deal hitch; hopes for agreement': '',
 'Older Headlines': '/q/h?s=AAPL&t=2016-01-27T03:49:08+05:30',
 'Samsung Elec warns of difficult 2016 as smartphone troubles spread': '',
 'Samsung wins appeal in patent dispute with Apple': '',
 "Signs of life for Apple's stock as Wall St eyes new iPhone": '',
 'Solid support for Apple in iPhone encryption fight - Reuters/Ipsos': '',
 'U.S. appeals court upholds Apple e-book settlement': '',
 'U.S., Apple ratchet up rhetoric in fight over encryption': '',
 'With China weakening, Apple turns to India': ''}

我们可以用 select:

soup = BeautifulSoup(_html)

a_tags ="#yfncsumtab a")
from pprint import pprint as pp
paired = dict((a.text, a["href"]) for a in a_tags)

这将匹配我们的 lxml 输出:

{u'Alphabet overtakes Apple in market value - for now': '',
 u'Alphabet passes Apple to become most valuable traded U.S. company': '',
 u'Apple bets new 4-inch iPhone to draw big-screen converts in China, India': '',
 u'Apple iPhone sales weaker than expected': '',
 u'Apple likely to invoke free-speech rights in encryption fight': '',
 u'Apple sees first sales dip in more than a decade as super-growth era falters': '',
 u'Apple shares fall most in two years in wake of earnings report': '',
 u"Apple's new iPhone faces challenge measuring up in China, India": '',
 u"Bad run continues for 'Freedom 251', website down again on second day": '',
 u"EXCLUSIVE - Common mobile software could have opened San Bernardino shooter's iPhone": '',
 u'Foxconn says in talks with Sharp over deal hitch; hopes for agreement': '',
 u'Older Headlines': '/q/h?s=AAPL&t=2016-01-27T03:49:08+05:30',
 u'Samsung Elec warns of difficult 2016 as smartphone troubles spread': '',
 u'Samsung wins appeal in patent dispute with Apple': '',
 u"Signs of life for Apple's stock as Wall St eyes new iPhone": '',
 u'Solid support for Apple in iPhone encryption fight - Reuters/Ipsos': '',
 u'U.S. appeals court upholds Apple e-book settlement': '',
 u'U.S., Apple ratchet up rhetoric in fight over encryption': '',
 u'With China weakening, Apple turns to India': ''}

您可以只使用 //*[@id='yfncsumtab']//a,因为 ID 应该是唯一的。

要使用 xpath 从 table 获取前六个链接,我们可以使用 ul 并使用 ul[position() < 7]:

提取前 6 个
a_tags  = xml.xpath("//*[@id='yfncsumtab']//ul[position() < 7]//a")

paired = dict((a.xpath("./text()")[0].strip(), a.xpath("./@href")[0]) for a in a_tags)
from pprint import pprint as pp


{'Apple bets new 4-inch iPhone to draw big-screen converts in China, India': '',
 "Apple's new iPhone faces challenge measuring up in China, India": '',
 'Foxconn says in talks with Sharp over deal hitch; hopes for agreement': '',
 'Samsung wins appeal in patent dispute with Apple': '',
 "Signs of life for Apple's stock as Wall St eyes new iPhone": '',
 'Solid support for Apple in iPhone encryption fight - Reuters/Ipsos': ''}

对于小 tables,您也可以简单地切片。