Python + selenium：提取标题之间可变数量的段落

Question

各位大侠，假设下面的html如何提取属于<h3>.

的段落<p>

<!DOCTYPE html>
    <html>
    <body>
    ...
        <div class="main-div">
            <h3>Title 1</h3>
            <p></p>
        
            <h3>Title 2</h3>
            <p></p>
            <p></p>
            <p></p>
            
            <h3>Title 3</h3>
            <p></p>
            <p></p>
            ...
        </div>
</body>

如您所见，<h3> 和 <p> 标签都是 <div> 标签的 children ，但它们没有 class或 id 可以识别它们并说“标题 1”有 1 段，标题 2 有 3 段，标题 3 有两段等等。我找不到将段落与标题联系起来的方法...

我正在尝试使用 Python 2.7 + selenium 来做到这一点。但我不确定我是否使用了正确的工具，也许你可以建议解决方案或任何不同的组合，如 Beautifulsoup、urllib2...

任何 suggestion/direction 将不胜感激！

更新

在@JustMe 指出的绝妙解决方案之后，我提出了以下解决方案，希望它能帮助其他人，或者如果有人可以将其改进为 pythonic。我来自 c/c++/java/perl 世界所以我总是碰壁 :)

import bs4

page = """ 
<!DOCTYPE html>
<html>
<body>
...
    <div class="maincontent-block">
        <h3>Title 1</h3>
        <p>1</p>
        <p>2</p>
        <p>3</p>

        <h3>Title 2</h3>
        <p>2</p>
        <p>3</p>
        <p>4</p>

        <h3>Title 3</h3>
        <p>7</p>
        <p>9</p>
        ...
    </div>
</body>
"""

page = bs4.BeautifulSoup(page, "html.parser")
div = page.find('div', {'class':"maincontent-block"})

mydict = {}

# write to the dictionary
for tag in div.findChildren():
    if (tag.name == "h3"):
        #print(tag.string)
        mydict[tag.string] = None
        nextTags = tag.findAllNext()
        arr = [];
        for nt in nextTags:
            if (nt.name == "p"):
                arr.append(nt.string)
                mydict[tag.string] = arr
            elif (nt.name == "h3"):
                arr = []
                break

# read from dictionary
arrKeys = []
for k in mydict:
    arrKeys.append(k)

arrKeys.sort()
for k in arrKeys:
    print k
    for v in mydict[k]:
        print v

Answer 1

使用BeautifulSoup

很容易做到

import bs4

page = """
<!DOCTYPE html>
    <html>
    <body>
    ...
        <div class="main-div">
            <h3>Title 1</h3>
            <p></p>

            <h3>Title 2</h3>
            <p></p>
            <p></p>
            <p></p>

            <h3>Title 3</h3>
            <p></p>
            <p></p>
            ...
        </div>
</body>
"""

page = bs4.BeautifulSoup(page)
h3_tag = page.div.find("h3").string
print(h3_tag)
>>> u'Title 1'

h3_tag.find_next_siblings("p")
>>> [<p></p>, <p></p>, <p></p>, <p></p>, <p></p>, <p></p>]
len(h3_tag.find_next_siblings("p"))/2
>>> 3

好吧，既然你想要分开计数段落，我想出了这个，粗鲁的东西。

 h_counters = []
 count = -1
 for child in page.div.findChildren():
     if "<h3>" in str(child):
         h_counters.append(count)
         count = 0
     else:
         count += 1
 h_counters.append(count)
 h_counters = h_counters[1:]
 print (h_counters)
 >> [1, 3, 2]

Python + selenium：提取标题之间可变数量的段落

Python + selenium: extract variable quantity of paragraphs between titles

html

python

selenium

urllib2

beautifulsoup

更新