使用 Beautiful Soup 进行抓取只会在特定部分导致错误(遇到 NullType 对象)
Scraping using Beautiful Soup leads to error only in a particular section (NullType object encountered)
我正在尝试从以下网站获取特定球队(在本例中为利物浦)的伤病名单
http://www.physioroom.com/news/english_premier_league/epl_injury_table.php
它在某些球队(斯旺西)中运行良好,但在某些球队(利物浦、Everyon)中出现以下错误
TypeError: Can't convert 'NoneType' object to str implicitly
这是我正在使用的代码。
from bs4 import BeautifulSoup
import urllib.request
url = "http://www.physioroom.com/news/english_premier_league/epl_injury_table.php"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, "html.parser")
#lp = soup.find(alt="Liverpool away shirt").parent.parent.parent.next_sibling.next_sibling
lp = soup.find(alt="Swansea City away shirt").parent.parent.parent.next_sibling.next_sibling
player_info = ""
player_list = []
while True:
if(lp.has_attr('id')):
break
else:
tdlist = lp.find_all('td')# player_info = tdlist[0].string+"\t"+tdlist[1].string+"\t"+tdlist[3].string
#print(tdlist[0].find('a').string.strip() + "\t" + tdlist[1].string.strip() + "\t" + tdlist[3].string.strip())
print(tdlist[0].string + "\t" + tdlist[1].string + "\t" + tdlist[3].string)
lp=lp.findNext('tr')
请告诉我如何解决这个问题。
from bs4 import BeautifulSoup
import requests
url = "http://www.physioroom.com/news/english_premier_league/epl_injury_table.php"
r = requests.get(url)
soup = BeautifulSoup(r.text, "lxml")
table = soup.find('table', id='epl-table')
for tr in table('tr', id=None):
print(tr.get_text('\t', strip=True))
输出:
PLAYER CONDITION LATEST NEWS EXPECTED RETURN AVAILABLE?
D Meyler Knock No Return Date Slight Doubt
S Maloney Ear Infection No Return Date Slight Doubt
M Henriksen Shoulder Separation April 1, 2017 Major Doubt
A McGregor Fitness No Return Date Major Doubt
W Keane ACL Knee Injury No Return Date
M Odubajo Patella Fracture May 1, 2017
G Luer Knee Injury February 1, 2017
如果您只需要文档或标签的文本部分,可以使用 get_text() 方法。它 returns 文档中或标签下的所有文本,作为单个 Unicode 字符串:
您可以指定一个字符串,用于将文本位连接在一起[=13=]
你可以告诉 Beautiful Soup 从每一位文本的开头和结尾去除空格
我正在尝试从以下网站获取特定球队(在本例中为利物浦)的伤病名单
http://www.physioroom.com/news/english_premier_league/epl_injury_table.php
它在某些球队(斯旺西)中运行良好,但在某些球队(利物浦、Everyon)中出现以下错误
TypeError: Can't convert 'NoneType' object to str implicitly
这是我正在使用的代码。
from bs4 import BeautifulSoup
import urllib.request
url = "http://www.physioroom.com/news/english_premier_league/epl_injury_table.php"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, "html.parser")
#lp = soup.find(alt="Liverpool away shirt").parent.parent.parent.next_sibling.next_sibling
lp = soup.find(alt="Swansea City away shirt").parent.parent.parent.next_sibling.next_sibling
player_info = ""
player_list = []
while True:
if(lp.has_attr('id')):
break
else:
tdlist = lp.find_all('td')# player_info = tdlist[0].string+"\t"+tdlist[1].string+"\t"+tdlist[3].string
#print(tdlist[0].find('a').string.strip() + "\t" + tdlist[1].string.strip() + "\t" + tdlist[3].string.strip())
print(tdlist[0].string + "\t" + tdlist[1].string + "\t" + tdlist[3].string)
lp=lp.findNext('tr')
请告诉我如何解决这个问题。
from bs4 import BeautifulSoup
import requests
url = "http://www.physioroom.com/news/english_premier_league/epl_injury_table.php"
r = requests.get(url)
soup = BeautifulSoup(r.text, "lxml")
table = soup.find('table', id='epl-table')
for tr in table('tr', id=None):
print(tr.get_text('\t', strip=True))
输出:
PLAYER CONDITION LATEST NEWS EXPECTED RETURN AVAILABLE?
D Meyler Knock No Return Date Slight Doubt
S Maloney Ear Infection No Return Date Slight Doubt
M Henriksen Shoulder Separation April 1, 2017 Major Doubt
A McGregor Fitness No Return Date Major Doubt
W Keane ACL Knee Injury No Return Date
M Odubajo Patella Fracture May 1, 2017
G Luer Knee Injury February 1, 2017
如果您只需要文档或标签的文本部分,可以使用 get_text() 方法。它 returns 文档中或标签下的所有文本,作为单个 Unicode 字符串:
您可以指定一个字符串,用于将文本位连接在一起[=13=]
你可以告诉 Beautiful Soup 从每一位文本的开头和结尾去除空格