美丽的汤只是提取header的一个table
Beautiful soup just extract header of a table
我想使用python 3.5 中的beautiful soup 从以下网站的table 中提取信息。
http://www.askapatient.com/viewrating.asp?drug=19839&name=ZOLOFT
我必须先保存 web-page,因为我的程序需要运行 off-line。
我把web-page保存在我的电脑里,我用下面的代码提取了table信息。但问题是代码只是提取 table 的标题。
这是我的代码:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
url = "file:///Users/MD/Desktop/ZoloftPage01.html"
home_page= urlopen(url)
soup = BeautifulSoup(home_page, "html.parser")
table = soup.find("table", attrs={"class":"ratingsTable" } )
comments = [td.get_text() for td in table.findAll("td")]
print(comments)
这是代码的输出:
['RATING', '\xa0 REASON', 'SIDE EFFECTS FOR ZOLOFT', 'COMMENTS', 'SEX', 'AGE', 'DURATION/DOSAGE', 'DATE ADDED ', '\xa0’]
我需要 table 行中的所有信息。
感谢您的帮助!
这是因为页面损坏HTML。您需要切换到更 宽松的解析器 ,例如 html5lib
。这对我有用:
from pprint import pprint
import requests
from bs4 import BeautifulSoup
url = "http://www.askapatient.com/viewrating.asp?drug=19839&name=ZOLOFT"
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'})
# HTML parsing part
soup = BeautifulSoup(response.content, "html5lib")
table = soup.find("table", attrs={"class":"ratingsTable"})
comments = [[td.get_text() for td in row.find_all("td")]
for row in table.find_all("tr")]
pprint(comments)
我想使用python 3.5 中的beautiful soup 从以下网站的table 中提取信息。
http://www.askapatient.com/viewrating.asp?drug=19839&name=ZOLOFT
我必须先保存 web-page,因为我的程序需要运行 off-line。
我把web-page保存在我的电脑里,我用下面的代码提取了table信息。但问题是代码只是提取 table 的标题。
这是我的代码:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
url = "file:///Users/MD/Desktop/ZoloftPage01.html"
home_page= urlopen(url)
soup = BeautifulSoup(home_page, "html.parser")
table = soup.find("table", attrs={"class":"ratingsTable" } )
comments = [td.get_text() for td in table.findAll("td")]
print(comments)
这是代码的输出:
['RATING', '\xa0 REASON', 'SIDE EFFECTS FOR ZOLOFT', 'COMMENTS', 'SEX', 'AGE', 'DURATION/DOSAGE', 'DATE ADDED ', '\xa0’]
我需要 table 行中的所有信息。 感谢您的帮助!
这是因为页面损坏HTML。您需要切换到更 宽松的解析器 ,例如 html5lib
。这对我有用:
from pprint import pprint
import requests
from bs4 import BeautifulSoup
url = "http://www.askapatient.com/viewrating.asp?drug=19839&name=ZOLOFT"
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'})
# HTML parsing part
soup = BeautifulSoup(response.content, "html5lib")
table = soup.find("table", attrs={"class":"ratingsTable"})
comments = [[td.get_text() for td in row.find_all("td")]
for row in table.find_all("tr")]
pprint(comments)