BeautifulSoup: 为什么我收到内部服务器错误？

Question

我想在 this 页面上抓取 table。

我写了这段代码：

import urllib
from urllib.request import urlopen
from bs4 import BeautifulSoup
import sys
import requests
import pandas as pd

webpage = 'https://web.iitm.ac.in/bioinfo2/cpad2/peptides/?page=1'
page = urllib.request.urlopen(webpage)
soup = BeautifulSoup(page,'html.parser')
soup_text = soup.get_text()
print(soup)

输出错误：

Traceback (most recent call last):
  File "scrape_cpad.py", line 9, in <module>
    page = urllib.request.urlopen(webpage)
  File "/Users/kela/anaconda/envs/py3/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/Users/kela/anaconda/envs/py3/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
  File "/Users/kela/anaconda/envs/py3/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "/Users/kela/anaconda/envs/py3/lib/python3.6/urllib/request.py", line 570, in error
    return self._call_chain(*args)
  File "/Users/kela/anaconda/envs/py3/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/Users/kela/anaconda/envs/py3/lib/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 500: Internal Server Error

我试过两台不同的计算机和网络。而且，我可以看到服务器是运行，因为我可以通过 HTML 访问该页面，还可以查看该页面的源代码。

我还尝试将 URL 从 https 更改为 http 或 www。

有人可以告诉我什么是能够连接到此页面以下拉 table 的工作代码吗？

p.s。我已经看到有类似的问题，例如here and ，但没有一个能回答我的问题。

Answer 1

soup = BeautifulSoup(page,'html.parser').context

Answer 2

似乎服务器拒绝了没有正确 User-Agent header.

的请求

我尝试将 User-Agent 设置到我的浏览器，我设法让它响应 HTML 页面：

webpage = 'https://web.iitm.ac.in/bioinfo2/cpad2/peptides/?page=1'
req = urllib.request.Request(webpage)
# spoof the UA header
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) 
Gecko/20100101 Firefox/77.0')

page = urllib.request.urlopen(req)

Answer 3

使用requests模块抓取页面

例如：

import requests
from bs4 import BeautifulSoup


url = 'https://web.iitm.ac.in/bioinfo2/cpad2/peptides/?page=1'
soup = BeautifulSoup(requests.get(url).content ,'html.parser')

for tr in soup.select('tr[data-toggle="modal"]'):
    print(tr.get_text(strip=True, separator=' '))
    print('-' * 120)

打印：

P-0001 GYE 3 Amyloid Amyloid-beta precursor protein (APP) P05067 No Org Lett. 2008 Jul 3;10(13):2625-8. 18529009 CPAD
------------------------------------------------------------------------------------------------------------------------
P-0002 KFFE 4 Amyloid J Biol Chem. 2002 Nov 8;277(45):43243-6. 12215440 CPAD
------------------------------------------------------------------------------------------------------------------------
P-0003 KVVE 4 Amyloid J Biol Chem. 2002 Nov 8;277(45):43243-6. 12215440 CPAD
------------------------------------------------------------------------------------------------------------------------
P-0004 NNQQ 4 Amyloid Eukaryotic peptide chain release factor GTP-binding subunit (ERF-3) P05453 Nature. 2007 May 24;447(7143):453-7. 17468747 CPAD
------------------------------------------------------------------------------------------------------------------------
P-0005 VKSE 4 Non-amyloid Microtubule-associated protein tau (PHF-tau) P10636 Proc Natl Acad Sci U S A. 2000 May 9;97(10):5129-34. 10805776 AmyLoad
------------------------------------------------------------------------------------------------------------------------
P-0006 AILSS 5 Amyloid Islet amyloid polypeptide (Amylin) P10997 No Proc Natl Acad Sci U S A. 1990 Jul;87(13):5036-40. 2195544 CPAD
------------------------------------------------------------------------------------------------------------------------


...and so on.

BeautifulSoup: 为什么我收到内部服务器错误？

BeautifulSoup: Why am i getting an internal server error?

python

urllib

beautifulsoup

python-requests