使用 urllib 读取 reddit 中的信息

Question

我得到了以下代码：

import urllib
import re

def worldnews():
    count = 0
    html = urllib.urlopen("https://www.reddit.com/r/worldnews/").readlines()

    lines = html
    for line in lines:
        if "Paris" or "Putin" in line:
            count = count + 1
            print line       

    print "Totaal gevonden: ", count
    print "----------------------"

worldnews()

如何在该页面上找到标题中包含 Paris 或 Puttin 的所有 reddit post。有没有办法将 post 的标题打印到控制台？当我运行现在我得到了很多 html 代码。

Answer 1

在 Python 中使用 HTML 的最佳方式是 BeautifulSoup。因此，您需要下载它并查看文档以了解如何完全按照您的要求进行操作。但是，我让你开始了：

import urllib
from bs4 import BeautifulSoup

def worldnews():
    count = 0
    html = urllib.urlopen("https://www.reddit.com/r/worldnews/")
    soup = BeautifulSoup(html,"lxml")
    titles = soup.find_all('p',{'class':'title'})
    for i in titles:
        print(i.text)

worldnews()

当这是运行时，它给出如下所示的输出：

Paris attacks ringleader dead - French officials (bbc.com)
Company which raised price of AIDS drug by 5500% reports m quarterly losses. (pinknews.co.uk)
Syria/IraqSyrian man kills judge at ISIS Sharia Court for beheading his brother (en.abna24.com)
Putin Puts  Million Bounty on Heads of Metrojet Bombers (fortune.com)

等等页面上的所有标题。从这里您应该能够稍微轻松地弄清楚如何编写其余代码。 :-)

使用 urllib 读取 reddit 中的信息

Readin information in reddit with urllib

python

urllib