在 Python 中经历 HTML DOM

Question

我想写一个 Python 脚本（使用 3.4.3）从 URL 抓取一个 HTML 页面并且可以通过 DOM尝试查找特定元素。

我目前有这个：

#!/usr/bin/env python
import urllib.request

def getSite(url):
    return urllib.request.urlopen(url)

if __name__ == '__main__':
    content = getSite('http://www.google.com').read()
    print(content)

当我打印内容时，它确实会打印出整个 html 页面，这与我想要的很接近......虽然我更希望能够浏览 DOM 而不是然后将其视为一个巨大的字符串。

我对 Python 还是很陌生，但有使用多种其他语言的经验（主要是 Java、C#、C++、C、PHP、JS）。我以前用 Java 做过类似的事情，但想在 Python.

中尝试一下

感谢任何帮助。干杯！

Answer 1

查看 BeautifulSoup 模块。

from bs4 import BeautifulSoup
import urllib                                       
soup = BeautifulSoup(urllib.urlopen("http://google.com").read())

for link in soup.find_all('a'):
    print(link.get('href'))

Answer 2

您可以使用许多不同的模块。例如，lxml or BeautifulSoup.

这是一个 lxml 示例：

import lxml.html

mysite = urllib.request.urlopen('http://www.google.com').read()
lxml_mysite = lxml.html.fromstring(mysite)

description = lxml_mysite.xpath("//meta[@name='description']")[0] # meta tag description
text = description.get('content') # content attribute of the tag

>>> print(text)
"Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."

还有一个BeautifulSoup例子：

from bs4 import BeautifulSoup

mysite = urllib.request.urlopen('http://www.google.com').read()
soup_mysite = BeautifulSoup(mysite)

description = soup_mysite.find("meta", {"name": "description"}) # meta tag description
text = description['content'] # text of content attribute

>>> print(text)
u"Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."

注意 BeautifulSoup returns 是一个 unicode 字符串，而 lxml 不是。根据需要，这可以是 useful/hurtful。

在 Python 中经历 HTML DOM

Going through HTML DOM in Python

html

python

dom

httprequest