xbmc/kodi python 使用 BeautifulSoup 抓取数据
xbmc/kodi python scrape data using BeautifulSoup
我想编辑一个使用 re.compile
来抓取数据的 Kodi 插件,并让它使用 BeautifulSoup4
来代替。
原代码是这样的:
import urllib, urllib2, re, sys, xbmcplugin, xbmcgui
link = read_url(url)
match = re.compile('<a class="frame[^"]*"'
' href="(http://somelink.com/section/[^"]+)" '
'title="([^"]+)">.*?<img src="([^"]+)".+?Length:([^<]+)',
re.DOTALL).findall(link)
for url, name, thumbnail, length in match:
addDownLink(name + length, url, 2, thumbnail)
它正在抓取的HTML是这样的:
<div id="content">
<span class="someclass">
<span class="sec">
<a class="frame" href="http://somlink.com/section/name-here" title="name here">
<img src="http://www.somlink.com/thumb/imgsection/thumbnail.jpg" >
</a>
</span>
<h3 class="title">
<a href="http://somlink.com/section/name-here">name here</a>
</h3>
<span class="details"><span class="length">Length: 99:99</span>
</span>
.
.
.
</div>
如何使用 BeautifulSoup4
获取所有 url
(href)、name
、length
和 thumbnail
,并将它们添加到 addDownLink(name + length, url, 2, thumbnail)
?
from bs4 import BeautifulSoup
html = """<div id="content">
<span class="someclass">
<span class="sec">
<a class="frame" href="http://somlink.com/section/name-here" title="name here">
<img src="http://www.somlink.com/thumb/imgsection/thumbnail.jpg" >
</a>
</span>
<h3 class="title">
<a href="http://somlink.com/section/name-here">name here</a>
</h3>
<span class="details"><span class="length">Length: 99:99</span>
</span>
</div>
"""
soup = BeautifulSoup(html, "lxml")
sec = soup.find("span", {"class": "someclass"})
# get a tag with frame class
fr = sec.find("a", {"class": "frame"})
# pull img src and href from the a/frame
url, img = fr["href"], fr.find("img")["src"]
# get h3 with title class and extract the text from the anchor
name = sec.select("h3.title a")[0].text
# "size" is in the span with the details class
size = sec.select("span.details")[0].text.split(None,1)[-1]
print(url, img, name.strip(), size.split(None,1)[1].strip())
这给你:
('http://somlink.com/section/name-here', 'http://www.somlink.com/thumb/imgsection/thumbnail.jpg', u'name here', u'99:99')
如果您有多个部分,我们只需要 find_all 并将逻辑应用于每个部分:
def secs():
soup = BeautifulSoup(html, "lxml")
sections = soup.find_all("span", {"class": "someclass"})
for sec in sections:
fr = sec.find("a", {"class": "frame"})
url, img = fr["href"], fr.find("img")["src"]
name, size = sec.select("h3.title a")[0].text, sec.select("span.details")[0].text.split(None,1)[-1]
yield url, name, img,size
如果您不知道所有 class 但您知道例如有一个 img 标签,您可以在该部分调用 find:
sec.find("img")["src"]
同样的逻辑也适用于其余部分。
我想编辑一个使用 re.compile
来抓取数据的 Kodi 插件,并让它使用 BeautifulSoup4
来代替。
原代码是这样的:
import urllib, urllib2, re, sys, xbmcplugin, xbmcgui
link = read_url(url)
match = re.compile('<a class="frame[^"]*"'
' href="(http://somelink.com/section/[^"]+)" '
'title="([^"]+)">.*?<img src="([^"]+)".+?Length:([^<]+)',
re.DOTALL).findall(link)
for url, name, thumbnail, length in match:
addDownLink(name + length, url, 2, thumbnail)
它正在抓取的HTML是这样的:
<div id="content">
<span class="someclass">
<span class="sec">
<a class="frame" href="http://somlink.com/section/name-here" title="name here">
<img src="http://www.somlink.com/thumb/imgsection/thumbnail.jpg" >
</a>
</span>
<h3 class="title">
<a href="http://somlink.com/section/name-here">name here</a>
</h3>
<span class="details"><span class="length">Length: 99:99</span>
</span>
.
.
.
</div>
如何使用 BeautifulSoup4
获取所有 url
(href)、name
、length
和 thumbnail
,并将它们添加到 addDownLink(name + length, url, 2, thumbnail)
?
from bs4 import BeautifulSoup
html = """<div id="content">
<span class="someclass">
<span class="sec">
<a class="frame" href="http://somlink.com/section/name-here" title="name here">
<img src="http://www.somlink.com/thumb/imgsection/thumbnail.jpg" >
</a>
</span>
<h3 class="title">
<a href="http://somlink.com/section/name-here">name here</a>
</h3>
<span class="details"><span class="length">Length: 99:99</span>
</span>
</div>
"""
soup = BeautifulSoup(html, "lxml")
sec = soup.find("span", {"class": "someclass"})
# get a tag with frame class
fr = sec.find("a", {"class": "frame"})
# pull img src and href from the a/frame
url, img = fr["href"], fr.find("img")["src"]
# get h3 with title class and extract the text from the anchor
name = sec.select("h3.title a")[0].text
# "size" is in the span with the details class
size = sec.select("span.details")[0].text.split(None,1)[-1]
print(url, img, name.strip(), size.split(None,1)[1].strip())
这给你:
('http://somlink.com/section/name-here', 'http://www.somlink.com/thumb/imgsection/thumbnail.jpg', u'name here', u'99:99')
如果您有多个部分,我们只需要 find_all 并将逻辑应用于每个部分:
def secs():
soup = BeautifulSoup(html, "lxml")
sections = soup.find_all("span", {"class": "someclass"})
for sec in sections:
fr = sec.find("a", {"class": "frame"})
url, img = fr["href"], fr.find("img")["src"]
name, size = sec.select("h3.title a")[0].text, sec.select("span.details")[0].text.split(None,1)[-1]
yield url, name, img,size
如果您不知道所有 class 但您知道例如有一个 img 标签,您可以在该部分调用 find:
sec.find("img")["src"]
同样的逻辑也适用于其余部分。