使用字符串操作或正则表达式排除非特定格式的抓取结果
Excluding scraped results that aren't of a specific format using string operations or regex
您好,我正在开发一个程序,可以从网站上抓取歌曲并将其放入列表中。到目前为止,这是我的代码
from bs4 import BeautifulSoup
import urllib2
from collections import namedtuple
url='http://www.xpn.org/playlists/xpn-playlist'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
songs=[]
Song = namedtuple("Song", "artist name album")
for link in soup.find_all("li", class_="song"):
song = Song._make(link.text.strip()[12:].split(" - "))
songs.append(song)
for song in songs:
print(song.artist, song.name, song.album)
如果结果的格式是
,效果很好
<li class="song"> <a href="/default.htm" onclick="return clickreturnvalue()" onmouseover="dropdownmenu(this, event, menu1, '100px','Jason & The Scorchers','I Really Don\'t Want To Know','Lost & Found')" onmouseout="delayhidemenu()">Buy</a> Jason & The Scorchers - I Really Don't Want To Know - Lost & Found</li>
但如果结果为格式则不起作用。
<li class="song">|World Cafe| - Thursday 10-22-2015 Hour 2, Part 7 - Host: David Dye</li>
我得到一个错误,因为只有两个“-”
TypeError Traceback (most recent call last)
<ipython-input-28-1a0a99934b5c> in <module>()
12 Song = namedtuple("Song", "artist name album")
13 for link in soup.find_all("li", class_="song"):
---> 14 song = Song._make(link.text.strip()[12:].split(" - "))
15 songs.append(song)
16
<string> in _make(cls, iterable, new, len)
TypeError: Expected 3 arguments, got 2
如何修改它以排除任何格式不正确的结果?
您应该检查一下是否真的得到了您想要的所有三个参数。如果您只想排除不是三个参数的结果,只需使用 try/except TypeError 块。
def threeArgs(one, two, three):
# function stuff
try:
threeArgs(1, 2)
except TypeError:
print "Skipping..."
这不会有错误,如果放入 for 循环,将跳过并继续。
为什么要尝试 namedtuple
-下面的代码对我有用。
from bs4 import BeautifulSoup
import requests
from collections import namedtuple
url='http://www.xpn.org/playlists/xpn-playlist'
page = requests.get(url)
soup = BeautifulSoup(page.content,'html.parser')
songs=[]
Song = namedtuple("Song", ["artist", "name", "album"])
for link in soup.find_all("li", class_="song"):
data = map(unicode.strip,link.text.replace('Buy','').strip().split(" - "))
if len(data)==3:
song = Song._make(data)
#print data
songs.append(song)
else:
print "More than 3 item found"
for song in songs:
print(song.artist, song.name, song.album)
您好,我正在开发一个程序,可以从网站上抓取歌曲并将其放入列表中。到目前为止,这是我的代码
from bs4 import BeautifulSoup
import urllib2
from collections import namedtuple
url='http://www.xpn.org/playlists/xpn-playlist'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
songs=[]
Song = namedtuple("Song", "artist name album")
for link in soup.find_all("li", class_="song"):
song = Song._make(link.text.strip()[12:].split(" - "))
songs.append(song)
for song in songs:
print(song.artist, song.name, song.album)
如果结果的格式是
,效果很好<li class="song"> <a href="/default.htm" onclick="return clickreturnvalue()" onmouseover="dropdownmenu(this, event, menu1, '100px','Jason & The Scorchers','I Really Don\'t Want To Know','Lost & Found')" onmouseout="delayhidemenu()">Buy</a> Jason & The Scorchers - I Really Don't Want To Know - Lost & Found</li>
但如果结果为格式则不起作用。
<li class="song">|World Cafe| - Thursday 10-22-2015 Hour 2, Part 7 - Host: David Dye</li>
我得到一个错误,因为只有两个“-”
TypeError Traceback (most recent call last)
<ipython-input-28-1a0a99934b5c> in <module>()
12 Song = namedtuple("Song", "artist name album")
13 for link in soup.find_all("li", class_="song"):
---> 14 song = Song._make(link.text.strip()[12:].split(" - "))
15 songs.append(song)
16
<string> in _make(cls, iterable, new, len)
TypeError: Expected 3 arguments, got 2
如何修改它以排除任何格式不正确的结果?
您应该检查一下是否真的得到了您想要的所有三个参数。如果您只想排除不是三个参数的结果,只需使用 try/except TypeError 块。
def threeArgs(one, two, three):
# function stuff
try:
threeArgs(1, 2)
except TypeError:
print "Skipping..."
这不会有错误,如果放入 for 循环,将跳过并继续。
为什么要尝试 namedtuple
-下面的代码对我有用。
from bs4 import BeautifulSoup
import requests
from collections import namedtuple
url='http://www.xpn.org/playlists/xpn-playlist'
page = requests.get(url)
soup = BeautifulSoup(page.content,'html.parser')
songs=[]
Song = namedtuple("Song", ["artist", "name", "album"])
for link in soup.find_all("li", class_="song"):
data = map(unicode.strip,link.text.replace('Buy','').strip().split(" - "))
if len(data)==3:
song = Song._make(data)
#print data
songs.append(song)
else:
print "More than 3 item found"
for song in songs:
print(song.artist, song.name, song.album)