如何使用 Beautifulsoup 解析来自相同 class 的信息?

How to parse information from same class using Beautifulsoup?

假设我有以下 HTML

html_doc = """

    <html>
    <head>
    <title>Page Title</title>
    </head>
    <body>
    
    <div class = "Box1">
      <span class = "catagory">Plant</span>
        <div class = "Box2">
          <span class = "sub-catagory">Trees</span>
            <div class = "characters">
              <div class = "font-medium">1.2</div>
              <div class = "font-medium">1.6</div>
              <div class = "font-medium">1.7</div>
              <div class = "font-medium">1.8</div>
              <div class = "font-medium">1.9</div>
              <div class = "font-medium">1.4</div>
            </div>
          <span class = "sub-catagory">Flowers</span>
            <div class = "characters">
              <div class = "font-medium">2.2</div>
              <div class = "font-medium">3.6</div>
              <div class = "font-medium">4.7</div>
              <div class = "font-medium">5.8</div>
              <div class = "font-medium">6.9</div>
              <div class = "font-medium">7.4</div>
            </div>
          </div>
      <span class = "catagory">animals</span>
        <div class = "Box2">
          <span class = "sub-catagory">human</span>
            <div class = "characters">
              <div class = "font-medium">7.2</div>
              <div class = "font-medium">9.6</div>
              <div class = "font-medium">4.7</div>
              <div class = "font-medium">3.8</div>
              <div class = "font-medium">6.9</div>
              <div class = "font-medium">9.4</div>
            </div>
          <span class = "sub-catagory">dog</span>
            <div class = "characters">
              <div class = "font-medium">4.2</div>
              <div class = "font-medium">5.6</div>
              <div class = "font-medium">6.7</div>
              <div class = "font-medium">1.8</div>
              <div class = "font-medium">3.9</div>
              <div class = "font-medium">8.4</div>
            </div>
          </div>
        <span class = "catagory">non-living</span>
        <div class = "Box2">
          <span class = "sub-catagory">rock</span>
            <div class = "characters">
              <div class = "font-medium">1.2</div>
              <div class = "font-medium">1.6</div>
              <div class = "font-medium">4.7</div>
              <div class = "font-medium">6.8</div>
              <div class = "font-medium">1.9</div>
              <div class = "font-medium">0.4</div>
            </div>
          <span class = "sub-catagory">stars</span>
            <div class = "characters">
              <div class = "font-medium">3.2</div>
              <div class = "font-medium">5.6</div>
              <div class = "font-medium">2.7</div>
              <div class = "font-medium">4.8</div>
              <div class = "font-medium">1.9</div>
              <div class = "font-medium">2.4</div>
            </div>
          </div>
      </div>
    </div>
    </body>
    </html>

"""

使用 Python 的 BeautifSoup 包,我可以分别获取类别、子类别、字符,如下所示:

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_doc, 'html.parser')
    catagories = soup.find_all('span',class_='catatory')
    for catatory in categories:
        print(catagory.get_text()) #gives the Plant, Animal, non-living
    sub-catatories = soup.find_all('span',class_='sub-catatory')
    for sub-catatory in sub-categories:
        print(sub-catagory.get_text()) # gives me subcategories
    measurements = soup.find_all('div',class_='font-medium')
    for measurement in measurements:
        print(measurement.get_text()) # gives me all the font-medium together.

我不确定如何得到以下结果,因为 div 类 都是一样的。请帮助

植物 树木 1.2 1.6 1.7 1.8 1.9 1.4 花朵 2.2 3.6 4.7 5.8 6.9 7.4 动物 人类 7.2 9.6 4.7 3.8 6.9 9.4 狗 4.2 5.6 6.7 1.8 3.9 8.4 无生命的 岩石 1.2 1.6 4.7 6.8 1.9 0.4 星星 3.2 5.6 2.7 4.8 1.9 2.4

以预期的方式打印您的文本,select 您的 Box1 并使用 get_text() 提取文本,同时将其 seperat / join 参数设置为 \n:

print(soup.select_one('.Box1').get_text('\n',strip=True))

Plant
Trees
1.2
1.6
1.7
1.8
1.9
1.4
Flowers
2.2
3.6
4.7
5.8
6.9
7.4
animals
...

要获得更结构化的输出,请更改获取元素的方式:

for e in soup.select('span.sub-catagory'):
    data.append({
        'cat': e.find_previous('span',{'class':'catagory'}).text,
        'subcat': e.text,
        'characters': list(e.find_next('div').stripped_strings)
    })
例子
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

data = []

for e in soup.select('span.sub-catagory'):
    print()
    data.append({
        'cat': e.find_previous('span',{'class':'catagory'}).text,
        'subcat': e.text,
        'characters': list(e.find_next('div').stripped_strings)
    })
data
输出
[{'cat': 'Plant',
  'subcat': 'Trees',
  'characters': ['1.2', '1.6', '1.7', '1.8', '1.9', '1.4']},
 {'cat': 'Plant',
  'subcat': 'Flowers',
  'characters': ['2.2', '3.6', '4.7', '5.8', '6.9', '7.4']},
 {'cat': 'animals',
  'subcat': 'human',
  'characters': ['7.2', '9.6', '4.7', '3.8', '6.9', '9.4']},
 {'cat': 'animals',
  'subcat': 'dog',
  'characters': ['4.2', '5.6', '6.7', '1.8', '3.9', '8.4']},
 {'cat': 'non-living',
  'subcat': 'rock',
  'characters': ['1.2', '1.6', '4.7', '6.8', '1.9', '0.4']},
 {'cat': 'non-living',
  'subcat': 'stars',
  'characters': ['3.2', '5.6', '2.7', '4.8', '1.9', '2.4']}]