如何使用网络爬虫遍历多个网站并解析文本

How to iterate over many websites and parse text using web crawler

我正在尝试解析文本并运行 对来自多个网站的文本进行情感分析。我已经成功地能够一次只剥离一个网站并使用 TextBlob 库生成一个情绪分数,但我正试图在许多网站上复制它,关于从哪里开始的任何想法?

代码如下:

import requests
import json
import urllib
from bs4 import BeautifulSoup
from textblob import TextBlob


url = "http://www.reddit.com/r/television/comments/38dqxf/josh_duggar_confessed_to_his_father_jim_bob/"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

#print(text)

wiki = TextBlob(text)
r = wiki.sentiment.polarity

print r

提前致谢

这是您通过 URL 在 python 中从网站获取数据的方式:

import urllib2
response = urllib2.urlopen('http://reddit.com/')
html = response.read()

html 是包含 URL.

中所有 HTML 的字符串

我不完全确定您想从每一页中获得什么。如果您在下方发表评论,我可以编辑此答案并进一步帮助您。

编辑:

如果你想遍历 URL 的列表,你可以创建一个函数并像这样进行操作:

#you can add to this
urls = ["http://www.google.com", "http://www.reddit.com"]


def parse_websites(list_of_urls):
    for url in list_of_urls:
        html = urllib.urlopen(url).read()
        soup = BeautifulSoup(html)
        # kill all script and style elements

        for script in soup(["script", "style"]):
            script.extract()    # rip it out

        # get text
        text = soup.get_text()

        # break into lines and remove leading and trailing space on each
        lines = (line.strip() for line in text.splitlines())
        # break multi-headlines into a line each
        chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
        # drop blank lines
        text = '\n'.join(chunk for chunk in chunks if chunk)

        #print(text)

        wiki = TextBlob(text)
        r = wiki.sentiment.polarity

        print r

parse_websites(urls)