使用 Python(音乐艺术家/标题)从 HTML 文件中提取文本
Extracting text from HTML file using Python (Music Artist / Title)
我想从页面中提取艺术家和歌曲名称。
页面:
http://www.swr3.de//-/id=47424/cf=42/did=65794/93avs/index.html?hour=5&date=2015-10-23
<div class="detail-body">
<h4 class="detail-heading" itemprop="name">No son of mine</h4>
<span itemprop="byArtist" itemscope="" itemtype="http://schema.org/MusicGroup"><link href="http://www.swr3.de/musik/poplexikon/-/id=927882/did=70326/i3zglz/index.html" itemprop="url">
<h5 itemprop="name">Genesis</h5>
这在页面上重复了几次(见顶部 link swr3.de)但我不知道如何使用 beautifulsoup 和 python 创建列表喜欢:
Genesis - No son of mine
Double You - Please don't go
使用 BeautifulSoup, requests, and lxml:
的组合
首先,安装先决条件:
pip install beautifulsoup4
pip install requests
pip install lxml
swr3.py:
import requests, lxml
from bs4 import BeautifulSoup
parsedsongs = []
result = requests.get('http://www.swr3.de//-/id=47424/cf=42/did=65794/93avs/index.html?hour=5&date=2015-10-23')
soup = BeautifulSoup(result.content, "lxml")
detailbodys = soup.find_all('div', 'detail-body')
for detailbody in detailbodys:
title = detailbody.h4.string.encode('utf-8').strip()
if detailbody.h5:
artist = detailbody.h5.string.encode('utf-8').strip()
else:
artist = detailbody.span.string.encode('utf-8').strip()
parsedsongs.append({'artist': artist, 'title': title})
for entry in parsedsongs:
print 'Artist: {}\tTitle: {}'.format(entry['artist'], entry['title'])
输出:
(swr3)macbook:swr3 joeyoung$ python swr3.py
Artist: Vaya Con Dios Title: Nah neh nah
Artist: Genesis Title: No son of mine
Artist: Genesis Title: No son of mine
Artist: Double You Title: Please don't go
Artist: Stereo MC's Title: Step it up
Artist: Cranberries Title: Zombie
Artist: La Bouche Title: Sweet dreams
Artist: Die Prinzen Title: Du mußt ein Schwein sein
Artist: Bad Religion Title: Punk rock song
Artist: Bellini Title: Samba de Janeiro
Artist: Dion, Celine; Bee Gees Title: Immortality
Artist: Jones, Tom; Mousse T. Title: Sex bomb
Artist: Yanai, Kate Title: Bacardi feeling (Summer dreamin')
Artist: Heroes Del Silencio Title: Entre dos tierras
我想从页面中提取艺术家和歌曲名称。
页面: http://www.swr3.de//-/id=47424/cf=42/did=65794/93avs/index.html?hour=5&date=2015-10-23
<div class="detail-body">
<h4 class="detail-heading" itemprop="name">No son of mine</h4>
<span itemprop="byArtist" itemscope="" itemtype="http://schema.org/MusicGroup"><link href="http://www.swr3.de/musik/poplexikon/-/id=927882/did=70326/i3zglz/index.html" itemprop="url">
<h5 itemprop="name">Genesis</h5>
这在页面上重复了几次(见顶部 link swr3.de)但我不知道如何使用 beautifulsoup 和 python 创建列表喜欢:
Genesis - No son of mine
Double You - Please don't go
使用 BeautifulSoup, requests, and lxml:
的组合首先,安装先决条件:
pip install beautifulsoup4
pip install requests
pip install lxml
swr3.py:
import requests, lxml
from bs4 import BeautifulSoup
parsedsongs = []
result = requests.get('http://www.swr3.de//-/id=47424/cf=42/did=65794/93avs/index.html?hour=5&date=2015-10-23')
soup = BeautifulSoup(result.content, "lxml")
detailbodys = soup.find_all('div', 'detail-body')
for detailbody in detailbodys:
title = detailbody.h4.string.encode('utf-8').strip()
if detailbody.h5:
artist = detailbody.h5.string.encode('utf-8').strip()
else:
artist = detailbody.span.string.encode('utf-8').strip()
parsedsongs.append({'artist': artist, 'title': title})
for entry in parsedsongs:
print 'Artist: {}\tTitle: {}'.format(entry['artist'], entry['title'])
输出:
(swr3)macbook:swr3 joeyoung$ python swr3.py
Artist: Vaya Con Dios Title: Nah neh nah
Artist: Genesis Title: No son of mine
Artist: Genesis Title: No son of mine
Artist: Double You Title: Please don't go
Artist: Stereo MC's Title: Step it up
Artist: Cranberries Title: Zombie
Artist: La Bouche Title: Sweet dreams
Artist: Die Prinzen Title: Du mußt ein Schwein sein
Artist: Bad Religion Title: Punk rock song
Artist: Bellini Title: Samba de Janeiro
Artist: Dion, Celine; Bee Gees Title: Immortality
Artist: Jones, Tom; Mousse T. Title: Sex bomb
Artist: Yanai, Kate Title: Bacardi feeling (Summer dreamin')
Artist: Heroes Del Silencio Title: Entre dos tierras