为什么我无法从 url 获取曲目标题？

Question

我正在尝试编写一个 python 脚本，该脚本使用 BeautifulSoup 从这个 Interent Archive page 中抓取曲目标题。我希望能够输出：

391106 - Bruce-Partington 计划 400311 - 退休的染色员 ...

但是我找不到标签。这是我的脚本：

#!/usr/bin/env python

import getopt, sys
# screen scraping stuff
import urllib2
import re
from bs4 import BeautifulSoup


def usage ( msg ):
        print """
usage: get_titles_sherlockholmes_basil.py

%s
""" % ( msg )
#end usage

def output_html ( url ):

        soup = BeautifulSoup(urllib2.urlopen( url ).read())

        #title = soup.find_all("div", class_="ttl")
        #titles = soup.find_all(class_="ttl")
        #titles = soup.find_all('<div class="ttl">')
        #titles = soup.select("div.ttl")
        #titles = soup.find_all("div", attrs={"class": "ttl"})
        #titles = soup.find_all("div", class_="jwrow")
        #titles = soup.find_all("div", id="jw6_list")
        titles = soup.find_all(id="jw6_list")
        for title in titles:
                print "%s <br>\n" % title
# end output_html

url = 'http://archive.org/details/HQSherlockRathboneTCS'
output_html ( url )
print "<br>-------------------<br>"
sys.exit()

我弄清楚我做错了什么。任何帮助表示赞赏。

Answer 1

问题是播放列表是在 javascript 的帮助下在浏览器中形成的。实际曲目列表位于 javascript 数组中的 script 标签内：

<script type="text/javascript">    

Play('jw6', 
     [{"title":"1. 391106 - Bruce-Partington Plans","image":"/download/HQSherlockRathboneTCS/391106.png","duration":1764,"sources":[{"file":"/download/HQSherlockRathboneTCS/391106.mp3","type":"mp3","height":"0","width":"0"}],"tracks":[{"file":"https://archive.org/stream/HQSherlockRathboneTCS/391106.png&vtt=vtt.vtt","kind":"thumbnails"}]},
{"title":"2. 400311 - The Retired Colourman","image":"/download/HQSherlockRathboneTCS/400311.png","duration":1755,"sources":[{"file":"/download/HQSherlockRathboneTCS/400311.mp3","type":"mp3","height":"0","width":"0"}],"tracks":[{"file":"https://archive.org/stream/HQSherlockRathboneTCS/400311.png&vtt=vtt.vtt","kind":"thumbnails"}]},
...
{"title":"32. 460204 - The Cross of Damascus","image":"/download/HQSherlockRathboneTCS/460204.png","duration":"1720.07","sources":[{"file":"/download/HQSherlockRathboneTCS/460204.mp3","type":"mp3","height":"0","width":"0"}],"tracks":[{"file":"https://archive.org/stream/HQSherlockRathboneTCS/460204.png&vtt=vtt.vtt","kind":"thumbnails"}]}], 
     {"start":0,"embed":null,"so":false,"autoplay":false,"width":0,"height":0,"audio":true,"responsive":true,"expand4wideVideos":false,"flash":false,"startPlaylistIdx":0,"identifier":"HQSherlockRathboneTCS","collection":"oldtimeradio","waveformer":"jw-holder","hide_list":false});

</script>

想法是使用 BeautifulSoup 定位 script 标签，使用正则表达式从脚本中提取列表并将其加载到 python 列表中 [=17] =]:

from ast import literal_eval
import re
import urllib2

from bs4 import BeautifulSoup

url = 'http://archive.org/details/HQSherlockRathboneTCS'
soup = BeautifulSoup(urllib2.urlopen(url))

script = soup.find('script', text=lambda x: x and 'jw6' in x)
text = script.text.replace('\n', '')

pattern = re.compile(r"Play\('jw6', (.*?),\s+\{\"start")

playlist = literal_eval(pattern.search(text).group(1).strip())
for track in playlist:
    print track['title']

打印：

1. 391106 - Bruce-Partington Plans
2. 400311 - The Retired Colourman
3. 440515 - Adventure Of The Missing Bloodstain
4. 450326 - The Book of Tobit
5. 450402 - The Amateur Mendicant Society
...
30. 460121 - Telltale Pigeon Feathers
31. 460128 - Sweeney Todd, Demon Barber
32. 460204 - The Cross of Damascus

为什么我无法从 url 获取曲目标题？

Why can't I get track titles from url?

html

python

regex

beautifulsoup

html-parsing