BeautifulSoup4 缺少元素?

BeautifulSoup4 missing elements?

总结:BS4 没有提取某些 td 元素的内容,返回 None 而不是它们包含的测试。我不明白为什么。

详细信息:我正在尝试使用 BS3(下面的代码)抓取 HTML table。 table 有三列,像这样:

<tr>
  <td>From Number</td>
  <td>Time</td>
  <td class="span10" style="word-wrap: break-word;">Message</td>
</tr>

(这些实际上是列 headers;包含在上下文中。)

通常下面的函数将每一行解析为:

[u'From Number', u' ', u'Time', u' ', u'Message']

但有时最后一个元素会出现 None:

[u'From Number', u' ', u'Time', u' ', None]

我以为是
标签,然后换行符导致了这个问题,但问题仍然存在,两者都被剥离了。

def grab_smss(soup):   # soup = the web page, parsed after applying
    """                # html_doc = html_doc.replace("\n", "")
    Extracts SMSs from page, in form [From, Ago, Msg]
    """
    sms_list = []
    in_smss = False
    brs = soup.findAll(name="br")   # Removes <br /> tags; looked 
    [br.extract() for br in brs]    # like these were the problem
    for row in soup.body.table.find_all('tr'):
        sms_row = [unicode(child.string) for child in row.children]
        sms_list.append(sms_row)
        if "From Number" in sms_row:
            in_smss = True
    return sms_list

以下是一些示例问题行(逐字记录,在剥离 br 标记和 \n 之前),以及这些行的函数结果:

<tr><td>1562375XXXX</td><td>2 minutes ago</td><td class="span10" style="word-wrap: break-word;">1234567: hi honney, trust trying how to use globfone. glad u told me about this site. it will be<br />
useful to me in the future. /check globfone.com<br /></td></tr>

给出:[u'1562375XXXX', u'26 minutes ago', u'None']

<tr><td>1360234XXXX</td><td>2 hours ago</td><td class="span10" style="word-wrap: break-word;">Your code is: 1083 Enter this code to verify your mobile phone number. The code is valid for 24<br />
hours.</td></tr>

给出:[u'1360234XXXX', u'3 hours ago', u'None']

可能导致此问题的原因。

试试这个

from bs4 import BeautifulSoup

data = '''<html><body><table><tr><td>1562375XXXX</td><td>2 minutes ago</td><td class="span10" style="word-wrap: break-word;">1234567: hi honney, trust trying how to use globfone. glad u told me about this site. it will be<br />
useful to me in the future. /check globfone.com<br /></td></tr></table></body></table>'''

def grab_smss(soup):   # soup = the web page, parsed after applying
    """                # html_doc = html_doc.replace("\n", "")
    Extracts SMSs from page, in form [From, Ago, Msg]
    """
    sms_list = []
    in_smss = False
    [s.extract() for s in soup('br')]
    for row in soup.body.table.find_all('tr'):
        sms_row = [' '.join(unicode(subchild.string) for subchild in child) for child in row.children]
        sms_list.append(sms_row)
        if "From Number" in sms_row:
            in_smss = True
    return sms_list


print grab_smss(BeautifulSoup(data))

由于 <br> 标签(即使被删除),第三个元素的子文本是元素的集合,因此 child.string returns None .如果您遍历它们并将它们连接成一个字符串,它就可以工作。