Python-Beautiful Soup 不解析整个无序列表

Python-Beautiful Soup not parsing entire unordered list

我正在尝试抓取一个网站,但有一部分让我感到困惑。有一个由组织服务的位置的无序列表,我似乎可以解析整个列表。

这是 HTML 的示例:

<div id="current_tab">

                <p class="view_label_type_geoserved" id="view_label_field_geoserved">Geographies Served</p>
                <ul>
                    <li class="view_type_geoserved" id="view_field_geoserved">
                        <p style="font-weight: bold; border-bottom: 1px dotted #CCC; font-size: .9em;">North Carolina (NC)<span style="float: right; font-size: 0.8em;">North Carolina (NC)</span></p>
                        <p style="margin: 5px 0 3px 8px; border-bottom: 1px dotted #DDD; font-size:1em">Durham (serves entire county)<span style="float: right; font-size: 0.8em;">Durham</span></p>
                    </li>
                        <p style="margin: 5px 0 3px 8px; border-bottom: 1px dotted #DDD; font-size:1em">Franklin (serves entire county)<span style="float: right; font-size: 0.8em;">Franklin</span></p>
                    </li>
                        <p style="margin: 5px 0 3px 8px; border-bottom: 1px dotted #DDD; font-size:1em">Granville (serves entire county)<span style="float: right; font-size: 0.8em;">Granville</span>
                        </p>
                    </li>
                        <p style="margin: 5px 0 3px 8px; border-bottom: 1px dotted #DDD; font-size:1em">Orange (serves entire county)<span style="float: right; font-size: 0.8em;">Orange</span></p>
                    </li>
                        <p style="margin: 5px 0 3px 8px; border-bottom: 1px dotted #DDD; font-size:1em">Person (serves entire county)<span style="float: right; font-size: 0.8em;">Person</span></p>
                    </li>
                        <p style="margin: 5px 0 3px 8px; border-bottom: 1px dotted #DDD; font-size:1em">Vance (serves entire county)<span style="float: right; font-size: 0.8em;">Vance</span></p>
                    </li>
                        <p style="margin: 5px 0 3px 8px; border-bottom: 1px dotted #DDD; font-size:1em">Wake (serves entire county)<span style="float: right; font-size: 0.8em;">Wake</span></p>
                    </li>
                    <p style="margin: 5px 0 3px 8px; border-bottom: 1px dotted #DDD; font-size:1em">Warren (serves entire county)<span style="float: right; font-size: 0.8em;">Warren</span></p>
                    </li>
            </ul>            
</div>

这是我用来解析元素的内容

for i in soup.find('div', {'id':'current_tab'}).findAll('p'):
    print i

这是我得到的结果,注意这只是列表的开头:

<p class="view_label_type_geoserved" id="view_label_field_geoserved">Geographies Served</p>
<p style="font-weight: bold; border-bottom: 1px dotted #CCC; font-size: .9em;">North Carolina (NC)<span style="float: right; font-size: 0.8em;">North Carolina (NC)</span></p>
<p style="margin: 5px 0 3px 8px; border-bottom: 1px dotted #DDD; font-size:1em">Durham (serves entire county)<span style="float: right; font-size: 0.8em;">Durham</span></p>
<p style="margin: 5px 0 3px 8px; border-bottom: 1px dotted #DDD; font-size:1em">Franklin (serves entire county)<span style="float: right; font-size: 0.8em;">Franklin</span></p>

一旦我得到 HTML 回来,我就有一些函数可以使用正则表达式去除文本,然后将它们连接成一个字符串,但也将不胜感激。

问题是您正在处理的 HTML 需要一个宽松的解析器来解析。

使用 lxmlhtml5lib:

soup = BeautifulSoup(data, 'html5lib')  # or BeautifulSoup(data, 'lxml')
for p in soup.select('div#current_tab p'):
    print p.text

对我有用,它打印:

Geographies Served
North Carolina (NC)North Carolina (NC)
Durham (serves entire county)Durham
Franklin (serves entire county)Franklin
Granville (serves entire county)Granville

Orange (serves entire county)Orange
Person (serves entire county)Person
Vance (serves entire county)Vance
Wake (serves entire county)Wake
Warren (serves entire county)Warren