如何抓取具有相同 html 属性和值的不同内容?
How to scrape the different content with the same html attributes and values?
我能够从网页中抓取大量数据,但我很难从具有完全相同属性和值的子部分中提取特定内容。这是 html:
<li class="highlight">
Relationship Issues
</li>
<li class="highlight">
Depression
</li>
<li class="highlight">
Spirituality
</li>
<li class="">
ADHD
</li>
<li class="">
Alcohol Use
</li>
<li class="">
Anger Management
</li>
使用 html 作为参考,我有以下内容:
import requests
from bs4 import BeautifulSoup
import html5lib
import re
headers = {'User-Agent': 'Mozilla/5.0'}
URL = "website.com"
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html5lib')
specialties = soup.find_all('div', {'class': 'spec-list attributes-top'})
for x in specialties:
Specialty_1 = x.find('li', {'class': 'highlight'}).text
Specialty_2 = x.find('li', {'class': 'highlight'}).text
Specialty_3 = x.find('li', {'class': 'highlight'}).text
所以理想的结果是:Specialty_1 = 关系问题; Specialty_2 = 抑郁; Specialty_3 = 灵性
和
Issue_1 = 多动症; Issue_2 = 饮酒; Issue_3 = 愤怒管理
非常感谢您的帮助!
如果您知道 xpath 将位于树中的相同元素结构中,则可以只使用 xpath。大多数情况下,您可以在 chrome devtools 中右键单击一个元素来获取选择器和 xpath 字符串。
您可以发展 Andrej 的字典想法,并根据存在的 class 使用 if else 来确定前缀并扩展 select 以包含附加部分。您需要重置新部分的编号,例如有旗帜
results = {}
flag = False
counter = 1
for j in soup.select(".specialties-list li, .attributes-issues li"):
if j['class']:
results[f'Specialty_{counter}'] = j.text.strip()
else:
if not flag:
counter = 1
flag = True
results[f'Issue_{counter}'] = j.text.strip()
counter +=1
print(results)
如果你想要可变数量的变量,使用字典。例如:
from bs4 import BeautifulSoup
html_doc = ''' <li class="highlight">
Relationship Issues
</li>
<li class="highlight">
Depression
</li>
<li class="highlight">
Spirituality
</li>
'''
soup = BeautifulSoup(html_doc, 'html.parser')
out = {'Specialty_{}'.format(i): specialty.get_text(strip=True) for i, specialty in enumerate(soup.select("li.highlight"), 1)}
print(out)
打印:
{'Specialty_1': 'Relationship Issues',
'Specialty_2': 'Depression',
'Specialty_3': 'Spirituality'}
我能够从网页中抓取大量数据,但我很难从具有完全相同属性和值的子部分中提取特定内容。这是 html:
<li class="highlight">
Relationship Issues
</li>
<li class="highlight">
Depression
</li>
<li class="highlight">
Spirituality
</li>
<li class="">
ADHD
</li>
<li class="">
Alcohol Use
</li>
<li class="">
Anger Management
</li>
使用 html 作为参考,我有以下内容:
import requests
from bs4 import BeautifulSoup
import html5lib
import re
headers = {'User-Agent': 'Mozilla/5.0'}
URL = "website.com"
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html5lib')
specialties = soup.find_all('div', {'class': 'spec-list attributes-top'})
for x in specialties:
Specialty_1 = x.find('li', {'class': 'highlight'}).text
Specialty_2 = x.find('li', {'class': 'highlight'}).text
Specialty_3 = x.find('li', {'class': 'highlight'}).text
所以理想的结果是:Specialty_1 = 关系问题; Specialty_2 = 抑郁; Specialty_3 = 灵性
和
Issue_1 = 多动症; Issue_2 = 饮酒; Issue_3 = 愤怒管理
非常感谢您的帮助!
如果您知道 xpath 将位于树中的相同元素结构中,则可以只使用 xpath。大多数情况下,您可以在 chrome devtools 中右键单击一个元素来获取选择器和 xpath 字符串。
您可以发展 Andrej 的字典想法,并根据存在的 class 使用 if else 来确定前缀并扩展 select 以包含附加部分。您需要重置新部分的编号,例如有旗帜
results = {}
flag = False
counter = 1
for j in soup.select(".specialties-list li, .attributes-issues li"):
if j['class']:
results[f'Specialty_{counter}'] = j.text.strip()
else:
if not flag:
counter = 1
flag = True
results[f'Issue_{counter}'] = j.text.strip()
counter +=1
print(results)
如果你想要可变数量的变量,使用字典。例如:
from bs4 import BeautifulSoup
html_doc = ''' <li class="highlight">
Relationship Issues
</li>
<li class="highlight">
Depression
</li>
<li class="highlight">
Spirituality
</li>
'''
soup = BeautifulSoup(html_doc, 'html.parser')
out = {'Specialty_{}'.format(i): specialty.get_text(strip=True) for i, specialty in enumerate(soup.select("li.highlight"), 1)}
print(out)
打印:
{'Specialty_1': 'Relationship Issues',
'Specialty_2': 'Depression',
'Specialty_3': 'Spirituality'}