使用 Python 进行网络抓取 HTML

Question

抱歉，如果这是重复的，但我一直在查看关于此的大量 Whosebug 问题，但找不到类似的情况。我可能在这里咆哮错误的树，但我是编程新手，所以即使有人可以让我走上正确的道路，它也会提供极大的帮助。

我正在尝试从一个网站上抓取数据，该网站只能使用 python 3.7 和 Beautiful soup 4 从我们的网络内部访问。我的第一个问题是，这是最佳实践方法吗对于新手程序员，还是我应该研究 javascript 而不是 python？

我的第二个问题是网站的根 html 文件具有以下 html 标记 xmlns="http://www.w3.org/1999/xhtml"。 BeautifulSoup4 和 xhtml 一起工作吗？

我承认我对 web 开发一无所知，所以即使有人可以给我一些关键字或提示来开始研究以使我走上一条更有成效的道路，我将不胜感激。现在我最大的问题是我不知道我不知道什么，所有 python 网络抓取示例都在更简单的 .html 页面上工作，而这个页面树由多个 html/css/jpg 和 gif 文件。

谢谢， -丹麦人

Answer 1

Python、requests 和 BeautifulSoup 绝对是必经之路，尤其是对于初学者而言。 BeautifulSoup 适用于 html、xml 等的所有变体。

您将需要安装 python，然后安装 requests 和 bs4。阅读 requests docs and the bs4 docs.

两者都很容易做到

如果您还不了解 python3，我建议您学习一些基础知识。

这是一个获取您请求的页面标题的简单示例：

import requests
from bs4 import BeautifulSoup as bs

url = 'http://some.local.domain/'

response = requests.get(url)
soup = bs(response.text, 'html.parser')

# let's get title of the page
title = soup.title
print(title)

# let's get all the links in the page
links = soup.find_all('a')
for link in links:
    print(link.get('href'))
    link1 = link[0]
    link2 = link[1]

# let's follow a link we find in the page (we'll go for the first)
response = requests.get(link1, stream=True)
# if we have an image and we want to download it 
if response.status_code == 200:
    with open(url.split('/')[-1], 'wb') as f:
        for chunk in response:
            f.write(chunk)

# if the link is another web page
response = requests.get(link2)
soup = bs(response.text, 'html.parser')

# let's get title of the page
title = soup.title
print(title)

继续寻找有关请求的教程，BeautfiulSoup 有很多... like this one

使用 Python 进行网络抓取 HTML

Webscraping HTML with Python

html

python

xhtml

automation

web-scraping