如何抓取具有相同 html 属性和值的不同内容？

Question

我能够从网页中抓取大量数据，但我很难从具有完全相同属性和值的子部分中提取特定内容。这是 html:

   <li class="highlight">
     Relationship Issues
      </li>
   <li class="highlight">
     Depression
      </li>
   <li class="highlight">
     Spirituality
      </li>

                                            <li class="">
                                                           ADHD
                                                   </li>
                                           <li class="">
                                                           Alcohol Use
                                                   </li>
                                           <li class="">
                                                           Anger Management
                                                   </li>

使用 html 作为参考，我有以下内容：

import requests
from bs4 import BeautifulSoup
import html5lib
import re

headers = {'User-Agent': 'Mozilla/5.0'}
URL = "website.com"


page = requests.get(URL, headers=headers)

soup = BeautifulSoup(page.content, 'html5lib')

specialties = soup.find_all('div', {'class': 'spec-list attributes-top'})

for x in specialties:
   Specialty_1 = x.find('li', {'class': 'highlight'}).text
   Specialty_2 = x.find('li', {'class': 'highlight'}).text
   Specialty_3 = x.find('li', {'class': 'highlight'}).text

所以理想的结果是：Specialty_1 = 关系问题； Specialty_2 = 抑郁； Specialty_3 = 灵性

和

Issue_1 = 多动症； Issue_2 = 饮酒； Issue_3 = 愤怒管理

非常感谢您的帮助！

Answer 1

如果您知道 xpath 将位于树中的相同元素结构中，则可以只使用 xpath。大多数情况下，您可以在 chrome devtools 中右键单击一个元素来获取选择器和 xpath 字符串。

Answer 2

您可以发展 Andrej 的字典想法，并根据存在的 class 使用 if else 来确定前缀并扩展 select 以包含附加部分。您需要重置新部分的编号，例如有旗帜

results = {}
flag = False
counter = 1

for j in soup.select(".specialties-list li, .attributes-issues li"):
    if j['class']:
        results[f'Specialty_{counter}'] =  j.text.strip()
    else:   
        if not flag:
            counter = 1
            flag = True
        results[f'Issue_{counter}'] = j.text.strip()
    counter +=1 
        
print(results)

Answer 3

如果你想要可变数量的变量，使用字典。例如：

from bs4 import BeautifulSoup


html_doc = '''   <li class="highlight">
     Relationship Issues
      </li>
   <li class="highlight">
     Depression
      </li>
   <li class="highlight">
     Spirituality
      </li>
'''

soup = BeautifulSoup(html_doc, 'html.parser')

out = {'Specialty_{}'.format(i): specialty.get_text(strip=True) for i, specialty in enumerate(soup.select("li.highlight"), 1)}

print(out)

打印：

{'Specialty_1': 'Relationship Issues', 
 'Specialty_2': 'Depression', 
 'Specialty_3': 'Spirituality'}

如何抓取具有相同 html 属性和值的不同内容？

How to scrape the different content with the same html attributes and values?

html

python

beautifulsoup

html5lib

web-scraping