BeautifulSoup4 缺少元素?
BeautifulSoup4 missing elements?
总结:BS4 没有提取某些 td 元素的内容,返回 None 而不是它们包含的测试。我不明白为什么。
详细信息:我正在尝试使用 BS3(下面的代码)抓取 HTML table。 table 有三列,像这样:
<tr>
<td>From Number</td>
<td>Time</td>
<td class="span10" style="word-wrap: break-word;">Message</td>
</tr>
(这些实际上是列 headers;包含在上下文中。)
通常下面的函数将每一行解析为:
[u'From Number', u' ', u'Time', u' ', u'Message']
但有时最后一个元素会出现 None:
[u'From Number', u' ', u'Time', u' ', None]
我以为是
标签,然后换行符导致了这个问题,但问题仍然存在,两者都被剥离了。
def grab_smss(soup): # soup = the web page, parsed after applying
""" # html_doc = html_doc.replace("\n", "")
Extracts SMSs from page, in form [From, Ago, Msg]
"""
sms_list = []
in_smss = False
brs = soup.findAll(name="br") # Removes <br /> tags; looked
[br.extract() for br in brs] # like these were the problem
for row in soup.body.table.find_all('tr'):
sms_row = [unicode(child.string) for child in row.children]
sms_list.append(sms_row)
if "From Number" in sms_row:
in_smss = True
return sms_list
以下是一些示例问题行(逐字记录,在剥离 br 标记和 \n 之前),以及这些行的函数结果:
<tr><td>1562375XXXX</td><td>2 minutes ago</td><td class="span10" style="word-wrap: break-word;">1234567: hi honney, trust trying how to use globfone. glad u told me about this site. it will be<br />
useful to me in the future. /check globfone.com<br /></td></tr>
给出:[u'1562375XXXX', u'26 minutes ago', u'None']
<tr><td>1360234XXXX</td><td>2 hours ago</td><td class="span10" style="word-wrap: break-word;">Your code is: 1083 Enter this code to verify your mobile phone number. The code is valid for 24<br />
hours.</td></tr>
给出:[u'1360234XXXX', u'3 hours ago', u'None']
可能导致此问题的原因。
试试这个
from bs4 import BeautifulSoup
data = '''<html><body><table><tr><td>1562375XXXX</td><td>2 minutes ago</td><td class="span10" style="word-wrap: break-word;">1234567: hi honney, trust trying how to use globfone. glad u told me about this site. it will be<br />
useful to me in the future. /check globfone.com<br /></td></tr></table></body></table>'''
def grab_smss(soup): # soup = the web page, parsed after applying
""" # html_doc = html_doc.replace("\n", "")
Extracts SMSs from page, in form [From, Ago, Msg]
"""
sms_list = []
in_smss = False
[s.extract() for s in soup('br')]
for row in soup.body.table.find_all('tr'):
sms_row = [' '.join(unicode(subchild.string) for subchild in child) for child in row.children]
sms_list.append(sms_row)
if "From Number" in sms_row:
in_smss = True
return sms_list
print grab_smss(BeautifulSoup(data))
由于 <br>
标签(即使被删除),第三个元素的子文本是元素的集合,因此 child.string
returns None .如果您遍历它们并将它们连接成一个字符串,它就可以工作。
总结:BS4 没有提取某些 td 元素的内容,返回 None 而不是它们包含的测试。我不明白为什么。
详细信息:我正在尝试使用 BS3(下面的代码)抓取 HTML table。 table 有三列,像这样:
<tr>
<td>From Number</td>
<td>Time</td>
<td class="span10" style="word-wrap: break-word;">Message</td>
</tr>
(这些实际上是列 headers;包含在上下文中。)
通常下面的函数将每一行解析为:
[u'From Number', u' ', u'Time', u' ', u'Message']
但有时最后一个元素会出现 None:
[u'From Number', u' ', u'Time', u' ', None]
我以为是
标签,然后换行符导致了这个问题,但问题仍然存在,两者都被剥离了。
def grab_smss(soup): # soup = the web page, parsed after applying
""" # html_doc = html_doc.replace("\n", "")
Extracts SMSs from page, in form [From, Ago, Msg]
"""
sms_list = []
in_smss = False
brs = soup.findAll(name="br") # Removes <br /> tags; looked
[br.extract() for br in brs] # like these were the problem
for row in soup.body.table.find_all('tr'):
sms_row = [unicode(child.string) for child in row.children]
sms_list.append(sms_row)
if "From Number" in sms_row:
in_smss = True
return sms_list
以下是一些示例问题行(逐字记录,在剥离 br 标记和 \n 之前),以及这些行的函数结果:
<tr><td>1562375XXXX</td><td>2 minutes ago</td><td class="span10" style="word-wrap: break-word;">1234567: hi honney, trust trying how to use globfone. glad u told me about this site. it will be<br />
useful to me in the future. /check globfone.com<br /></td></tr>
给出:[u'1562375XXXX', u'26 minutes ago', u'None']
<tr><td>1360234XXXX</td><td>2 hours ago</td><td class="span10" style="word-wrap: break-word;">Your code is: 1083 Enter this code to verify your mobile phone number. The code is valid for 24<br />
hours.</td></tr>
给出:[u'1360234XXXX', u'3 hours ago', u'None']
可能导致此问题的原因。
试试这个
from bs4 import BeautifulSoup
data = '''<html><body><table><tr><td>1562375XXXX</td><td>2 minutes ago</td><td class="span10" style="word-wrap: break-word;">1234567: hi honney, trust trying how to use globfone. glad u told me about this site. it will be<br />
useful to me in the future. /check globfone.com<br /></td></tr></table></body></table>'''
def grab_smss(soup): # soup = the web page, parsed after applying
""" # html_doc = html_doc.replace("\n", "")
Extracts SMSs from page, in form [From, Ago, Msg]
"""
sms_list = []
in_smss = False
[s.extract() for s in soup('br')]
for row in soup.body.table.find_all('tr'):
sms_row = [' '.join(unicode(subchild.string) for subchild in child) for child in row.children]
sms_list.append(sms_row)
if "From Number" in sms_row:
in_smss = True
return sms_list
print grab_smss(BeautifulSoup(data))
由于 <br>
标签(即使被删除),第三个元素的子文本是元素的集合,因此 child.string
returns None .如果您遍历它们并将它们连接成一个字符串,它就可以工作。