网页抓取内容溢出

Question

我正在尝试使用 Jupyter Lab 中的 beautifulsoup 抓取本地站点，但它只有一个内容过多的页面。当我尝试运行此代码时：

import requests
from bs4 import BeautifulSoup
import re
import string

login_url=('http://192.168.1.18/index.php?go=login')
login_success=('http://192.168.1.18/cashier')

payload={
    'is_submitted': 1,
    'username':'admin',
    'password':'admin',
    'submit':'Submit',
}
headers={
   'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 Edg/91.0.864.64',
}
s = requests.session()
r = s.post(login_url,data=payload)
soup = BeautifulSoup(r.content,'html.parser')
req =s.get(login_success,headers=headers)
soups= BeautifulSoup(req.content,'html.parser')
print(soups.prettify())

它抛出这个错误：

IOPub data rate exceeded. The Jupyter server will temporarily stop sending output to the client in order to avoid crashing it. To change this limit, set the config variable --ServerApp.iopub_data_rate_limit. Current values: ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec) ServerApp.rate_limit_window=3.0 (secs)

虽然我已经试过了你可以检查它以了解更多细节。

Answer 1

请注意，这不是错误。您的代码运行良好。 Jupyter 试图通过一次显示过多的内容来保护您的浏览器不会崩溃。计算仍在底层进行，只是打印被抑制以帮助您。尝试打印前 1000 个字符或类似的内容。

至于评论中建议的问题：确实需要针对JupyterLab 3.0+进行调整；请注意这是 ServerApp 而不是 NotebookApp 现在：

jupyter lab --ServerApp.iopub_data_rate_limit=1.0e10

此外，如果您想将设置存储在文件中，则应该 jupyter_server_config.py 而不是 jupyter_notebook_config.py；你可以通过以下方式获得它：

jupyter server --generate-config

然后更改 ServerApp.iopub_data_rate_limit traitlet，例如：

c.ServerApp.iopub_data_rate_limit = 1000000

还有其他可能感兴趣的特征：

c.ServerApp.iopub_msg_rate_limit = 1000
c.ServerApp.rate_limit_window = 3

网页抓取内容溢出

Web scraping Content Overflow

python

beautifulsoup

web-scraping

jupyter-notebook

jupyter-lab