如何抓取一个网站首页的所有文字内容？

Question

所以我是网络抓取的新手，我想抓取主页的所有文本内容。

这是我的代码，但它现在可以正常工作了。

from bs4 import BeautifulSoup
import requests


website_url = "http://www.traiteurcheminfaisant.com/"
ra = requests.get(website_url)
soup = BeautifulSoup(ra.text, "html.parser")

full_text = soup.find_all()

print(full_text)

当我打印 "full_text" 时，它给了我很多 html 内容，但不是全部，当我 ctrl + f " traiteurcheminfaisant@hotmail.com" 主页（页脚）上的电子邮件地址时在 full_text.

上找不到

感谢您的帮助！

Answer 1

我以前没有用过 BeatifulSoup，但尝试改用 urlopen。这会将网页存储为字符串，您可以使用它来查找电子邮件。

from urllib.request import urlopen

try:
    response = urlopen("http://www.traiteurcheminfaisant.com")
    html = response.read().decode(encoding = "UTF8", errors='ignore')
    print(html.find("traiteurcheminfaisant@hotmail.com"))
except:
    print("Cannot open webpage")

Answer 2

快速浏览一下您试图从中抓取的网站，我怀疑在通过请求模块发送简单的获取请求时并非所有内容都已加载。换句话说，网站上的某些组件（例如您提到的页脚）似乎正在使用 Javascript.

异步加载

如果是这种情况，您可能需要使用某种自动化工具导航到该页面，等待它加载，然后解析完全加载的源代码。为此，最常用的工具是 Selenium。第一次设置可能有点棘手，因为您还需要为您想使用的任何浏览器安装一个单独的网络驱动程序。也就是说，我上次设置它时非常简单。这是一个粗略的示例，说明这对您来说可能是什么样子（一旦您正确设置了 Selenium）：

from bs4 import BeautifulSoup
from selenium import webdriver

import time

driver = webdriver.Firefox(executable_path='/your/path/to/geckodriver')
driver.get('http://www.traiteurcheminfaisant.com')
time.sleep(2)

source = driver.page_source
soup = BeautifulSoup(source, 'html.parser')

full_text = soup.find_all()

print(full_text)

如何抓取一个网站首页的所有文字内容？

How to scrape all the home page text content of a website?

python

data-mining

web-scraping