使用 python 从网站抓取链接 / 用于 Kodi 插件的漂亮汤

Scraping links from a website using python / beautiful soup for a Kodi addon

我试图从(对于 Kodi 插件)抓取媒体 link 的网站没有太多 class 等标记,但每个 link 采用某种独特的布局。

我已经从另一个可用的插件创建了基本的 Kodi 插件,但是我在获取 Python/BeautifulSoup 抓取 link 时遇到了问题。其他插件使用 class 等 headers,但我试图从中抓取的网站在这方面用处不大。

我试过各种论坛都没有成功,大多数 Kodi 插件论坛都很旧而且不是很活跃。我看过的指南从第 1 步到第 1000 步似乎很快,而且它提供的示例不相关。我查看了 30 多个不同的插件,认为应该有所帮助,但我无法解决。

我要抓取的媒体 link、剧集标题、描述和图像列在 www.thisiscriminal.com/episodes

到目前为止我完成的完整插件在 Github-repository

我可以在源代码中看到它们被清楚地列出(见代码)

我基本上只需要能够解析一个网站,为每一集找到下面的位,将它们填充为 kodi 插件页面上的 links,然后在下面列出下一个。任何帮助将不胜感激。我连续 3 天都在努力做到这一点,对于我从 2002 年开始的 IT 学位课程退学感到非常高兴和恼火。

我需要提取网站代码

(episode image)
<img width="300" height="300" ...
https://thisiscriminal.com/wp-content/uploads/2019/05/Cecilia_art.png" ../>    

(episode title)
<h3><a href="https://thisiscriminal.com/episode-115-cecilia-5-24-19/">Cecilia</a></h3>

(episode number)
<h4>Episode #115</h4>

(episode link)
<p><a href="https://dts.podtrac.com/redirect.mp3/dovetail.prxu.org/criminal/a91a9494-fb45-48c5-ad4c-2615bfefd81b/Episode_115_Cecilia_Part_1.mp3"

(episode description)
</header>When Cecilia....</article>

代码

import requests
import re
from bs4 import BeautifulSoup

def get_soup(url):
    """
    @param: url of site to be scraped
    """
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')

    print "type: ", type(soup)
    return soup

get_soup("https://thisiscriminal.com/episodes")

def get_playable_podcast(soup):
    """
    @param: parsed html page
    """
    subjects = []

    for content in soup.find_all('a'):

        try:
            link = content.find('<p><a href="https://dts.podtrac.com/redirect.mp3/dovetail.prxu.org/criminal/')
            link = link.get('href')
            print "\n\nLink: ", link

            title = content.find('<h4>Episode ')
            title = title.get_text()

            desc = content.find('div', {'class': 'summary'})
            desc = desc.get_text()


            thumbnail = content.find('img')
            thumbnail = thumbnail.get('src')
        except AttributeError:
            continue


        item = {
                'url': link,
                'title': title,
                'desc': desc,
                'thumbnail': thumbnail
        }

        #needto check that item is not null here
        subjects.append(item)

    return subjects

2019-06-09 00:05:35.719 T:1916360240 ERROR: window 10502中的控件55已要求聚焦,但无法聚焦 2019-06-09 00:05:41.312 T:1165988576 错误:抛出异常(PythonToCppException):-->Python callback/script 返回以下错误<- -注意:忽略这可能会导致内存泄漏! 错误类型: 错误内容:'ascii' 编解码器无法解码位置 0 中的字节 0xa0:序号不在范围内 (128) 追溯(最近一次通话): 文件“/home/osmc/.kodi/addons/plugin.audio.abcradionational/addon.py”,第 44 行,位于 desc = soup.get_text().replace('\xa0', ' ').replace('\n', ' ') UnicodeDecodeError:'ascii' 编解码器无法解码位置 0 中的字节 0xa0:序号不在范围内(128) -->Python脚本错误报告结束<-- 2019-06-09 00:05:41.636 T:1130349280 错误:GetDirectory - 获取插件时出错://plugin.audio.abcradionational/ 2019-06-09 00:05:41.636 T:1916360240 错误:CGUIMediaWindow::GetDirectory(plugin://plugin.audio.abcradionational/) 失败

正如杰克指出的那样,页面响应包括 JavaScript 调用 AJAX 的代码。此代码包含在页面响应中但未由

执行

虽然 允许为您呈现此内容,但我建议您使用替代方案。

使用任何浏览器导航到该页面(显示 Chrome)。按 F12 打开开发者工具

我们对“网络”选项卡感兴趣。 Select XHR 也是如此。现在开发人员工具已打开,按 Ctrl + R 重新加载页面并记录 XHR 请求。

你应该看到这样的东西:

您可以检查每一个。我想您会对 /episodes 端点感兴趣:

这是一个结构化的,更具体地说,是一个 JSON 响应。要利用此端点,您只需使用 .

发出相同的 GET 请求

这可以简单地通过以下方式完成:

  1. Right-clicking 回应
  2. Selecting Copy -> Copy as cURL (Select cURL (Bash) 如果有选择)
  3. 粘贴到cURL Converter

好消息是该页面获得了 wp json 内容源加载,您可以针对此发布简单的 xhr。其他答案似乎很好地涵盖了如何找到它。

然后您可以根据需要从 json 中解析出信息。文本描述在 json returned 中为 html,因此您可以将其传递给 bs4 并根据需要进行解析。下面的例子。您可以浏览与 Cecilia here 相关的 json 对象,或者将以下内容粘贴到 json 查看器中:

{'title': 'Cecilia', 'excerpt': {'short': 'When Cecilia Gentili was growing up in Argentina, she felt so different from everyone around her that she thought she might be from another...', 'long': "When Cecilia Gentili was growing up in Argentina, she felt so different from everyone around her that she thought she might be from another planet. “Some of us find our community with our own family and some of us don't.” Sponsors: Article Visit article.com/criminal to get  off your...", 'full': "When Cecilia Gentili was growing up in Argentina, she felt so different from everyone around her that she thought she might be from another planet. “Some of us find our community with our own family and some of us don't.” Sponsors: Article Visit article.com/criminal to get  off your first purchase..."}, 'content': '<p data-pm-context="[]">When Cecilia Gentili was growing up in Argentina, she felt so different from everyone around her that she thought she might be from another planet. “Some of us find our community with our own family and some of us don&#8217;t.”</p>\n<p data-pm-context="[]">Sponsors:</p>\n<p><strong>Article</strong> Visit <a href="http://article.com/criminal">article.com/criminal </a>to get  off your first purchase of 0 or more.</p>\n<p><a href="https://www.therealreal.com/"><strong>The Real Real</strong></a> Shop in-store, online, or download the app, and get 20% off select items with the promo code REAL.</p>\n<p><strong>Simplisafe</strong> Protect your home today and get free shipping at <a href="http://SimpliSafe.com/CRIMINAL">SimpliSafe.com/CRIMINAL</a></p>\n<p><strong>Squarespace</strong> Try <a href="http://Squarespace.com/criminal">Squarespace.com/criminal </a>for a free trial and when you’re ready to launch, use the offer code INVISIBLE to save 10% off your first purchase of a website or domain.</p>\n<p><strong>Sun Basket</strong> Go to <a href="http://sunbasket.com/criminal">sunbasket.com/criminal </a>to get up to  off today!</p>\n', 'image': {'thumb': 'https://thisiscriminal.com/wp-content/uploads/2019/05/Cecilia_art-150x150.png', 'medium': 'https://thisiscriminal.com/wp-content/uploads/2019/05/Cecilia_art-300x300.png', 'large': 'https://thisiscriminal.com/wp-content/uploads/2019/05/Cecilia_art-1024x1024.png', 'full': 'https://thisiscriminal.com/wp-content/uploads/2019/05/Cecilia_art.png'}, 'episodeNumber': '115', 'audioSource': 'https://dts.podtrac.com/redirect.mp3/dovetail.prxu.org/criminal/a91a9494-fb45-48c5-ad4c-2615bfefd81b/Episode_115_Cecilia_Part_1.mp3', 'musicCredits':"FALSE", 'id': 3129, 'slug': 'episode-115-cecilia-5-24-19', 'date': '2019-05-24 19:43:44', 'permalink': 'https://thisiscriminal.com/episode-115-cecilia-5-24-19/', 'next':"None", 'prev': {'slug': 'episode-114-philip-and-becky', 'title': 'Episode 114: Philip and Becky (5.10.2019)'}}

该请求是一个 queryString url,因此您可以将项目数更改为 return,您将在响应中看到列出的总页数,以便您知道需要多少请求return 所有内容。

如果你看这里

posts=1000&page=1

您可以看到两个可以相应更改的参数。

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=1000&page=1').json()

for post in r['posts']:
    title = post['title']
    soup = bs(post['content'])
    desc = soup.select_one('p').text  # soup.get_text().replace('\xa0', ' ').replace('\n', ' ')
    img = post['image']['full']
    episode_link = post['audioSource'] #sure this is what you wanted?
    episode_number = post['episodeNumber']