Python

Question

我试图为 Reddit 的 /r/all 创建一个网络爬虫，它收集顶部 post 的链接。我一直在 YouTube 上关注 thenewboston's web crawler tutorial series 的第一部分。

在我的代码中，我删除了 while 循环 ，它设置了 thenewboston 中要抓取的页面数量的限制' s 案例（我只打算抓取 /r/all 的前 25 post，只有一页）。当然，我已经进行了这些更改以适应我的网络爬虫的目的。

在我的代码中，我已将 URL 变量更改为“http://www.reddit.com/r/all/”（出于显而易见的原因），并将 Soup.findAll 可迭代更改为 Soup.findAll('a', {'class': 'title may-blank loggedin'})（title may-blank loggedin 是 Reddit 上 post 标题的 class。

这是我的代码：

import requests
from bs4 import BeautifulSoup

def redditSpider():
    URL = 'http://www.reddit.com/r/all/'
    sourceCode = requests.get(URL)
    plainText = sourceCode.text
    Soup = BeautifulSoup(plainText)
    for link in Soup.findAll('a', {'class': 'title may-blank loggedin'}):
        href = 'http://www.reddit.com/r/all/' + link.get('href')
        print(href)

redditSpider()

我在每行之间使用 print 语句完成了一些业余 bug-checking，似乎没有执行 for 循环。

要跟进或比较 thenewboston 的代码与我的代码，请跳到他 mini-series 的第二部分并在他的视频中找到显示他的代码的位置.

编辑： thenewboston 请求代码：

import requests
from bs4 import BeautifulSoup

def trade_spider(max_pages):
    page = 1
    while page <= max_pages:
        url = 'https://buckysroom.org/trade/search.php?page=' + str(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text)
        for link in Soup.findAll('a', {'class': 'item-name'}):
            href = 'http://buckysroom.org' + link.get('href')
            print(href)
        page += 1

trade_spider()

Answer 1

这并不是您问题的直接答案，但我想让您知道有一个 API 为 Python 制作的 Reddit，名为 PRAW（Python Reddit Api Wrapper）你可能想看看它，因为它可以更容易地完成你想做的事情。

Link: https://praw.readthedocs.org/en/v2.1.20/

Answer 2

所以首先，newboston 似乎是一个截屏视频，所以把代码放在那里会有帮助。

其次，我建议将文件输出到本地，这样您就可以在浏览器中打开它并在 Web 工具中四处查看以查看您想要的内容。我还建议使用 ipython 在本地文件上使用 BeautfulSoup，而不是每次都抓取它。

如果你把这个扔进去你可以做到：

plainText = sourceCode.text
f = open('something.html', 'w')
f.write(sourceCode.text.encode('utf8'))

当我运行你的代码时，首先我不得不等待，因为有好几次它给我返回了一个我经常请求的错误页面。这可能是你的第一个问题。

当我确实获得该页面时，有很多链接，但 none 与您的 class 链接。如果不观看整个 Youtube 系列，我不确定 'title may-blank loggedin' 应该代表什么。

现在我看到问题了

这是登录class，你没有用你的爬虫登录。

您不需要登录就可以看到 /r/all，只需使用这个：

soup.findAll('a', {'class': 'title may-blank '})

Answer 3

您不是 "logged in"，因此永远不会应用 class 样式。这在没有登录的情况下有效：

import requests
from bs4 import BeautifulSoup

def redditSpider():
    URL = 'http://www.reddit.com/r/all'
    source = requests.get(URL)
    Soup = BeautifulSoup(source.text)
    for link in Soup.findAll('a',attrs={'class' : 'title may-blank '}):
        href = 'http://www.reddit.com/r/all/' + link.get('href')
        print(href)

redditSpider()

Python - Reddit 网络爬虫使用 BeautifulSoup4 returns 什么都没有

Python - Reddit web crawler using BeautifulSoup4 returns nothing

for-loop

beautifulsoup

reddit

web-crawler