仅提取链接和标题
Extracting links and titles only
我正在尝试提取动漫网站中这些链接的链接和标题,但是,我只能提取整个标签,我只想要 href 和标题。
这是我使用的代码:
import requests
from bs4 import BeautifulSoup
r = requests.get('http://animeonline.vip/info/phi-brain-kami-puzzle-3')
soup = BeautifulSoup(r.content, "html.parser")
for link in soup.find_all('div', class_='list_episode'):
href = link.get('href')
print(href)
这是网站 html:
<a href="http://animeonline.vip/phi-brain-kami-puzzle-3-episode-25" title="Phi Brain: Kami no Puzzle 3 episode 25">
Phi Brain: Kami no Puzzle 3 episode 25 <span> 26-03-2014</span>
</a>
这是输出:
C:\Python34\python.exe C:/Users/M.Murad/PycharmProjects/untitled/Webcrawler.py
None
Process finished with exit code 0
我想要的是 class 中的所有链接和标题(剧集及其链接)
谢谢。
整个页面只有一个class 'list_episode'的元素,所以你可以过滤掉'a'标签,然后获取属性'href'的值:
In [127]: import requests
...: from bs4 import BeautifulSoup
...:
...: r = requests.get('http://animeonline.vip/info/phi-brain-kami-puzzle-3')
...: soup = BeautifulSoup(r.content, "html.parser")
...:
In [128]: [x.get('href') for x in soup.find('div', class_='list_episode').find_all('a')]
Out[128]:
[u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-25',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-24',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-23',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-22',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-21',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-20',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-19',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-18',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-17',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-16',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-15',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-14',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-13',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-12',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-11',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-10',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-9',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-8',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-7',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-6',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-5',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-4',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-3',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-2',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-1']
那么发生的事情是,您的 link 元素包含锚点 <div>
和 class = "last_episode" 中的所有信息,但是其中包含很多锚点在 "href" 中包含 link,在 "title" 中包含标题。
只要稍微修改一下代码,你就会得到你想要的。
import requests
from bs4 import BeautifulSoup
r = requests.get('http://animeonline.vip/info/phi-brain-kami-puzzle-3')
soup = BeautifulSoup(r.content, "html.parser")
for link in soup.find_all('div', class_='list_episode'):
href_and_title = [(a.get("href"), a.get("title")) for a in link.find_all("a")]
print href_and_title
输出格式为[(href,title),(href,title),.......(href,title)]
编辑(解释):
所以发生的事情是当你做
soup.find_all('div', class_='list_episode')
它为您提供了 "div" 和 class "last_episode" 的所有详细信息(在 html 页面中),但现在这个锚点拥有大量具有不同 "href" 和标题详细信息,因此我们使用 for 循环(可以有多个锚点 (<a>
))和“.get()”.
href_and_title = [(a.get("href"), a.get("title")) for a in link.find_all("a")]
我希望这次更清楚。
我正在尝试提取动漫网站中这些链接的链接和标题,但是,我只能提取整个标签,我只想要 href 和标题。
这是我使用的代码:
import requests
from bs4 import BeautifulSoup
r = requests.get('http://animeonline.vip/info/phi-brain-kami-puzzle-3')
soup = BeautifulSoup(r.content, "html.parser")
for link in soup.find_all('div', class_='list_episode'):
href = link.get('href')
print(href)
这是网站 html:
<a href="http://animeonline.vip/phi-brain-kami-puzzle-3-episode-25" title="Phi Brain: Kami no Puzzle 3 episode 25">
Phi Brain: Kami no Puzzle 3 episode 25 <span> 26-03-2014</span>
</a>
这是输出:
C:\Python34\python.exe C:/Users/M.Murad/PycharmProjects/untitled/Webcrawler.py
None
Process finished with exit code 0
我想要的是 class 中的所有链接和标题(剧集及其链接)
谢谢。
整个页面只有一个class 'list_episode'的元素,所以你可以过滤掉'a'标签,然后获取属性'href'的值:
In [127]: import requests
...: from bs4 import BeautifulSoup
...:
...: r = requests.get('http://animeonline.vip/info/phi-brain-kami-puzzle-3')
...: soup = BeautifulSoup(r.content, "html.parser")
...:
In [128]: [x.get('href') for x in soup.find('div', class_='list_episode').find_all('a')]
Out[128]:
[u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-25',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-24',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-23',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-22',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-21',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-20',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-19',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-18',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-17',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-16',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-15',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-14',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-13',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-12',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-11',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-10',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-9',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-8',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-7',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-6',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-5',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-4',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-3',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-2',
u'http://animeonline.vip/phi-brain-kami-puzzle-3-episode-1']
那么发生的事情是,您的 link 元素包含锚点 <div>
和 class = "last_episode" 中的所有信息,但是其中包含很多锚点在 "href" 中包含 link,在 "title" 中包含标题。
只要稍微修改一下代码,你就会得到你想要的。
import requests
from bs4 import BeautifulSoup
r = requests.get('http://animeonline.vip/info/phi-brain-kami-puzzle-3')
soup = BeautifulSoup(r.content, "html.parser")
for link in soup.find_all('div', class_='list_episode'):
href_and_title = [(a.get("href"), a.get("title")) for a in link.find_all("a")]
print href_and_title
输出格式为[(href,title),(href,title),.......(href,title)]
编辑(解释):
所以发生的事情是当你做
soup.find_all('div', class_='list_episode')
它为您提供了 "div" 和 class "last_episode" 的所有详细信息(在 html 页面中),但现在这个锚点拥有大量具有不同 "href" 和标题详细信息,因此我们使用 for 循环(可以有多个锚点 (<a>
))和“.get()”.
href_and_title = [(a.get("href"), a.get("title")) for a in link.find_all("a")]
我希望这次更清楚。