如何抓取跨度兄弟跨度的文本?
How to scrape text of a span sibling span?
您好,我正在尝试学习如何进行网络抓取,所以我开始尝试通过网络抓取我的学校菜单。
我遇到了一个问题,我无法在范围 class 下获取菜单项,而是在范围 class“显示”的同一行内获取单词。
here is a short amount of the html text I am trying to work with
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome(executable_path=chromedriver.exe')#changed this
driver.get('https://housing.ucdavis.edu/dining/menus/dining-commons/tercero/')
results = []
content = driver.page_source
soups = BeautifulSoup(content, 'html.parser')
element=soups.findAll('span',class_ = 'collapsible-heading-status')
for span in element:
print(span.text)
我已经尝试将它变成 span.span.text 但那不会 return 我有任何东西所以有人可以给我一些关于如何在 collapsible-heading-status 下提取信息的指示class.
美味的华夫饼 - 如前所述,它们已经消失了,但要实现您的目标,一种方法是使用 adjacent sibling combinator
:
通过 css selectors
select 名称
for e in soup.select('.collapsible-heading-status + span'):
print(e.text)
或 find_next_sibling()
:
for e in soup.find_all('span',class_ = 'collapsible-heading-status'):
print(e.find_next_sibling('span').text)
例子
要以结构化的方式获取每个信息的完整信息,您可以使用:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://housing.ucdavis.edu/dining/menus/dining-commons/tercero/")
soup = BeautifulSoup(driver.page_source, 'html.parser')
data = []
for e in soup.select('.nutrition'):
d = {
'meal':e.find_previous('h4').text,
'title':e.find_previous('h5').text,
'name':e.find_previous('span').text,
'description': e.p.text
}
d.update({n.text:n.find_next().text.strip(': ') for n in e.select('h6')})
data.append(d)
data
输出
[{'meal': 'Breakfast',
'title': 'Fresh Inspirations',
'name': 'Vanilla Chia Seed Pudding with Blueberrries',
'description': 'Vanilla chia seed pudding with blueberries, shredded coconut, and toasted almonds',
'Serving Size': '1 serving',
'Calories': '392.93',
'Fat (g)': '36.34',
'Carbohydrates (g)': '17.91',
'Protein (g)': '4.59',
'Allergens': 'Tree Nuts/Coconut',
'Ingredients': 'Coconut milk, chia seeds, beet sugar, imitation vanilla (water, vanillin, caramel color, propylene glycol, ethyl vanillin, potassium sorbate), blueberries, shredded sweetened coconut (desiccated coconut processed with sugar, water, propylene glycol, salt, sodium metabisulfite), blanched slivered almonds'},
{'meal': 'Breakfast',
'title': 'Fresh Inspirations',
'name': 'Housemade Granola',
'description': 'Crunchy and sweet granola made with mixed nuts and old fashioned rolled oats',
'Serving Size': '1/2 cup',
'Calories': '360.18',
'Fat (g)': '17.33',
'Carbohydrates (g)': '47.13',
'Protein (g)': '8.03',
'Allergens': 'Gluten/Wheat/Dairy/Peanuts/Tree Nuts',
'Ingredients': 'Old fashioned rolled oats (per manufacturer, may contain wheat/gluten), sunflower seeds, seedless raisins, unsalted butter, pure clover honey, peanut-free mixed nuts (cashews, almonds, sunflower oil and/or cottonseed oil, pecans, hazelnuts, dried Brazil nuts, salt), light brown beet sugar, molasses'},...]
您好,我正在尝试学习如何进行网络抓取,所以我开始尝试通过网络抓取我的学校菜单。
我遇到了一个问题,我无法在范围 class 下获取菜单项,而是在范围 class“显示”的同一行内获取单词。
here is a short amount of the html text I am trying to work with
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome(executable_path=chromedriver.exe')#changed this
driver.get('https://housing.ucdavis.edu/dining/menus/dining-commons/tercero/')
results = []
content = driver.page_source
soups = BeautifulSoup(content, 'html.parser')
element=soups.findAll('span',class_ = 'collapsible-heading-status')
for span in element:
print(span.text)
我已经尝试将它变成 span.span.text 但那不会 return 我有任何东西所以有人可以给我一些关于如何在 collapsible-heading-status 下提取信息的指示class.
美味的华夫饼 - 如前所述,它们已经消失了,但要实现您的目标,一种方法是使用 adjacent sibling combinator
:
css selectors
select 名称
for e in soup.select('.collapsible-heading-status + span'):
print(e.text)
或 find_next_sibling()
:
for e in soup.find_all('span',class_ = 'collapsible-heading-status'):
print(e.find_next_sibling('span').text)
例子
要以结构化的方式获取每个信息的完整信息,您可以使用:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://housing.ucdavis.edu/dining/menus/dining-commons/tercero/")
soup = BeautifulSoup(driver.page_source, 'html.parser')
data = []
for e in soup.select('.nutrition'):
d = {
'meal':e.find_previous('h4').text,
'title':e.find_previous('h5').text,
'name':e.find_previous('span').text,
'description': e.p.text
}
d.update({n.text:n.find_next().text.strip(': ') for n in e.select('h6')})
data.append(d)
data
输出
[{'meal': 'Breakfast',
'title': 'Fresh Inspirations',
'name': 'Vanilla Chia Seed Pudding with Blueberrries',
'description': 'Vanilla chia seed pudding with blueberries, shredded coconut, and toasted almonds',
'Serving Size': '1 serving',
'Calories': '392.93',
'Fat (g)': '36.34',
'Carbohydrates (g)': '17.91',
'Protein (g)': '4.59',
'Allergens': 'Tree Nuts/Coconut',
'Ingredients': 'Coconut milk, chia seeds, beet sugar, imitation vanilla (water, vanillin, caramel color, propylene glycol, ethyl vanillin, potassium sorbate), blueberries, shredded sweetened coconut (desiccated coconut processed with sugar, water, propylene glycol, salt, sodium metabisulfite), blanched slivered almonds'},
{'meal': 'Breakfast',
'title': 'Fresh Inspirations',
'name': 'Housemade Granola',
'description': 'Crunchy and sweet granola made with mixed nuts and old fashioned rolled oats',
'Serving Size': '1/2 cup',
'Calories': '360.18',
'Fat (g)': '17.33',
'Carbohydrates (g)': '47.13',
'Protein (g)': '8.03',
'Allergens': 'Gluten/Wheat/Dairy/Peanuts/Tree Nuts',
'Ingredients': 'Old fashioned rolled oats (per manufacturer, may contain wheat/gluten), sunflower seeds, seedless raisins, unsalted butter, pure clover honey, peanut-free mixed nuts (cashews, almonds, sunflower oil and/or cottonseed oil, pecans, hazelnuts, dried Brazil nuts, salt), light brown beet sugar, molasses'},...]