如何迭代未嵌套的 html 以在 BeautifulSoup 中以列表格式提取内容

How to iterate over unnested html to extract contents in a list format in BeautifulSoup

我的问题比较具体。我一直在查看有关 SO 的所有其他 BeautifulSoup 问题,但尚未找到我的问题的答案。我已经获取了一个 pdf 文件并将其转换为比较不错的 html,目的是将其进一步转录为 csv 文件。

我正在使用的网页看起来像这样,除了我编辑了一堆我不确定我想让普通谷歌用户看到的东西:

(RUSI) US Foundation
Last Updated: 2014-12-29
At A Glance
[st # redacted] I St. N.W.
Washington, DC United States 20006
Type of Grantmaker
Independent foundation
Financial Data
(yr. ended 2013-12-31)
Assets: ,085 Total giving: [=12=]
EIN
[redacted]
990
[redacted]
Application Information
Unsolicited requests for funds not accepted.
Application form not required.
Directors Michael Clarke Sean Murphy Timothy Voake
Financial Data
Year ended 2013-12-31
Assets: ,085 (market value)
Expenditures: 7
Total giving: [=12=]
Qualifying distributions: 7
Additional Location Information
County: District of Columbia
Metropolitan area: Washington-Arlington-Alexandria, DC-VA-MD-WV Congressional district: District of Columbia District At-large

04Arts Foundation
Last Updated: 2013-05-15
At A Glance
P.O. Box [redacted]
San Antonio, TX United States 78283-1253 Telephone:(210) [redacted] Contact: Penelope Speier URL: www.04arts.org
Type of Grantmaker
Independent foundation
Financial Data
(yr. ended 2012-12-31)
Assets: ,957 Total giving: ,698
EIN
[redacted]
990
[redacted]
Additional Contact Information
Application Address: [redacted] Dallas, New Braunfels, TX 78130
Background
Established in 1995 in TX.
Limitations
No grants to individuals.
Fields of Interest Subjects
Arts
Application Information
Application form not required.
Initial approach: Proposal Deadline(s): None
Donor(s)
Note: If a donor is deceased, the symbol (f) follows the name.
Penelope Gallagher William Gallagher Edward Everett Collins, III Edwards Aquifer Authority
Officer
Penelope Speier, Pres.
Directors Wendy W. Atwell Jon Cochran
Financial Data
Year ended 2012-12-31
Assets: ,957 (market value)
Gifts received: $[redacted] Expenditures: $[redacted] Total giving: $[redacted] Qualifying distributions: $[redacted] Giving activities include:
$[redacted] for grants
Additional Location Information
County: Bexar
Metropolitan area: San Antonio, TX Congressional district: Texas District 35

1 in 9: The Long Island Breast Cancer Action Coalition, Inc
Last Updated: 2011-12-19
At A Glance
[redacted] E. Rockaway Rd.
Hewlett, NY United States 11557-1736 Telephone:(516) [redacted] Fax: (516) [redacted] E-mail: [redacted]
Type of Grantmaker
Public charity
Additional Descriptor
Organization that normally receives a substantial part of its support from a governmental unit or from the general public
EIN
[redacted]
990
[redacted]
Purpose and Activities
The coalition's mission is to promote awareness of the breast cancer epidemic through education, outreach, advocacy, and direct support of research which is being done to find the causes of and cures for breast cancer and other related cancers.
Fields of Interest Subjects
Breast cancer
Breast cancer research
Cancer
Cancer research
Types of Support
Research
Publications
Newsletter
Officers and Directors
Note: An asterisk (*) following an individual's name indicates an officer who is also a trustee or director.
Geri Barish *, Pres.
Louise Levrie, V.P.
Larry Slatky *, Treas.
Caroline Boss Fran Kritchek Frank P. Naudus Leon Newman
Additional Location Information
County: Nassau
Metropolitan area: New York-Northern New Jersey-Long Island, NY-NJ-PA Congressional district: New York District 04

我的 html 目前看起来像这样(就是这样,所以请注意,这很可怕):

<p style="text-align:justify;"><span class="font7" style="color:#CB4810;">FOUNDATION</span></p><a name="caption1"></a><h1 style="text-align:justify;"><a name="bookmark0"></a><span class="font7" style="color:#CB4810;"><a href="https://fconline.foundationcenter.org/">DIRECTORY</a></span></h1><div style="float:right;layout-flow:horizontal;">
<p><span class="font4"><a href="https://fconline.foundationcenter.org/grantmaker-profile/save?html_id=54c1468ec37a7">Save this Page</a></span></p></div>
<p style="text-align:justify;"><span class="font1" style="color:#ED977A;">ONLINE </span><span class="font1" style="color:#9D9D9D;">.*&gt;. </span><span class="font1" style="font-weight:bold;color:#9D9D9D;">A </span><span class="font1" style="color:#9D9D9D;">service of the &nbsp;&nbsp;&nbsp;</span><span class="font1" style="color:#808080;">_ </span><span class="font1">...... _</span></p>
<p style="text-align:right;padding:0pt 0pt 23pt 0pt;"><span class="font4" style="text-decoration:underline;">Print this Page</span></p>
<p style="text-align:justify;padding:23pt 0pt 9pt 0pt;"><span class="font4">(</span><span class="font4" style="font-weight:bold;">Refinements: </span><span class="font4">Grantmaker Name: *)</span></p><h2 style="text-align:justify;padding:9pt 0pt 0pt 0pt;"><a name="bookmark1"></a><span class="font6" style="font-weight:bold;">(RUSI) US Foundation</span></h2>
<p style="text-align:justify;padding:0pt 0pt 14pt 0pt;"><span class="font1" style="font-weight:bold;">Last Updated: </span><span class="font2">2014</span><span class="font0">-</span><span class="font2">12-29</span></p><h3 style="text-align:justify;padding:14pt 0pt 0pt 0pt;"><a name="bookmark2"></a><span class="font5" style="font-weight:bold;">At A Glance</span></h3>
<p style="text-align:justify;"><span class="font4">1776 I St. N.W.</span></p>
<p style="text-align:justify;padding:0pt 0pt 9pt 0pt;"><span class="font4">Washington, DC United States 20006</span></p><h4 style="text-align:justify;padding:9pt 0pt 0pt 0pt;"><a name="bookmark3"></a><span class="font4" style="font-weight:bold;">Type of Grantmaker</span></h4>
<p style="text-align:justify;padding:0pt 0pt 9pt 0pt;"><span class="font4">Independent foundation</span></p><h4 style="text-align:justify;padding:9pt 0pt 0pt 0pt;"><a name="bookmark4"></a><span class="font4" style="font-weight:bold;">Financial Data</span></h4>
<p style="text-align:justify;"><span class="font4">(yr. ended 2013-12-31)</span></p>
<p style="padding:0pt 421pt 9pt 0pt;"><span class="font4">Assets: ,085 Total giving: [=13=]</span></p><h4 style="text-align:justify;padding:9pt 0pt 0pt 0pt;"><a name="bookmark5"></a><span class="font4" style="font-weight:bold;">EIN</span></h4>
<p style="text-align:justify;padding:0pt 0pt 9pt 0pt;"><span class="font4">721374719</span></p><h4 style="text-align:justify;padding:9pt 0pt 0pt 0pt;"><a name="bookmark6"></a><span class="font4" style="font-weight:bold;">990</span></h4>
<p style="text-align:justify;padding:0pt 0pt 9pt 0pt;"><span class="font4"><a href="http://990s.foundationcenter.org/990pf_pdf_archive/721/721374719/721374719_201312_990PF.pdf">2013 </a><a href="http://990s.foundationcenter.org/990pf_pdf_archive/721/721374719/721374719_200412_990PF.pdf">2004</a><a href="http://990s.foundationcenter.org/990_pdf_archive/721/721374719/721374719_200312_990EZ.pdf"> 2003 </a><a href="http://990s.foundationcenter.org/990pf_pdf_archive/721/721374719/721374719_200212_990PF.pdf">2002</a></span></p><h4 style="text-align:justify;padding:9pt 0pt 0pt 0pt;"><a name="bookmark7"></a><span class="font4" style="font-weight:bold;">Application Information</span></h4>
<p style="text-align:justify;padding:0pt 0pt 9pt 0pt;"><span class="font4">Unsolicited requests for funds not accepted.</span></p>
<p style="text-align:justify;padding:9pt 0pt 14pt 0pt;"><span class="font4">Application form not required.</span></p>
<p style="padding:14pt 421pt 14pt 0pt;"><span class="font4" style="font-weight:bold;">Directors Michael Clarke&nbsp;Sean Murphy&nbsp;Timothy Voake</span></p><h4 style="text-align:justify;padding:14pt 0pt 0pt 0pt;"><a name="bookmark8"></a><span class="font4" style="font-weight:bold;">Financial Data</span></h4>
<p style="text-align:justify;padding:0pt 0pt 9pt 0pt;"><span class="font4" style="font-weight:bold;">Year ended 2013-12-31</span></p>
<p style="text-align:justify;padding:9pt 0pt 0pt 0pt;"><span class="font4">Assets: ,085 (market value)</span></p>
<p style="text-align:justify;"><span class="font4">Expenditures: 7</span></p>
<p style="text-align:justify;"><span class="font4">Total giving: [=13=]</span></p>
<p style="text-align:justify;padding:0pt 0pt 9pt 0pt;"><span class="font4">Qualifying distributions: 7</span></p><h4 style="text-align:justify;padding:9pt 0pt 0pt 0pt;"><a name="bookmark9"></a><span class="font4" style="font-weight:bold;">Additional Location Information</span></h4>
<p style="text-align:justify;"><span class="font4">County: District of Columbia</span></p>

现在,当我通过 运行 这个代码使用 BS 时;

from bs4 import BeautifulSoup as Soup

html = Soup(open('found1.html'))
titles = html.find_all('h2', style="text-align:justify;padding:9pt 0pt 0pt 0pt;")
print(titles[0].find(text=True))
print(titles[0].find_next('p', style="text-align:justify;padding:0pt 0pt 14pt 0pt;").\
     find_all(text=True))
print(titles[0].find_next('span', class_="font5",\
                          style="font-weight:bold;").find(text=True))

我明白了;

(RUSI) US Foundation
['Last Updated: ', '2014', '-', '12-29']
At A Glance

太棒了!下一部分我遇到了困难。我需要抓取 'At A Glance' 和 'Type of Grantmaker' 之间的所有内容。然后我需要为 'Type of Grantmaker' 和下一组做这个。这样做的一个好处是,相似标题的标签几乎总是相同的。例如,这就是我如何使用 titles = html.... 代码获取所有标题的名称。

我想要的输出是一个如下所示的列表:

[[first organization, last_updated, at_a_glance, type_of_grantmaker, financial_data, ...], 
[second organization, ...], [third organization, ...], ...]

非常感谢朝着正确方向迈出的任何一步!如果您认为我的问题出于某种原因很糟糕,我将不胜感激与 -1 一起发表评论,以便我可以解决它。我是新人,我最后的问题没有得到很好的回应...

事实证明,对我来说最简单的方法是在将其放入 BeautifulSoup 之前将其拆分。所以我所做的是使用以下代码拆分它,然后(目前)编写一个函数来很好地处理文本分解。

from bs4 import BeautifulSoup as Soup

with open('found1.html', 'r') as f:
    html = f.read()
sections = html.split('</a><span class="font6" style="font-weight:bold;">')


# Developing this bit to extract text cleanly.
def extract(html):
    html = Soup(html)
    html.find_all(text=True)
    print(extract)
    print(html.text)


# Gives me the whole html between the first title and the second
print(sections[1])
extract(sections[1])