如何使用网络爬虫遍历多个网站并解析文本
How to iterate over many websites and parse text using web crawler
我正在尝试解析文本并运行 对来自多个网站的文本进行情感分析。我已经成功地能够一次只剥离一个网站并使用 TextBlob 库生成一个情绪分数,但我正试图在许多网站上复制它,关于从哪里开始的任何想法?
代码如下:
import requests
import json
import urllib
from bs4 import BeautifulSoup
from textblob import TextBlob
url = "http://www.reddit.com/r/television/comments/38dqxf/josh_duggar_confessed_to_his_father_jim_bob/"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
#print(text)
wiki = TextBlob(text)
r = wiki.sentiment.polarity
print r
提前致谢
这是您通过 URL 在 python 中从网站获取数据的方式:
import urllib2
response = urllib2.urlopen('http://reddit.com/')
html = response.read()
html
是包含 URL.
中所有 HTML 的字符串
我不完全确定您想从每一页中获得什么。如果您在下方发表评论,我可以编辑此答案并进一步帮助您。
编辑:
如果你想遍历 URL 的列表,你可以创建一个函数并像这样进行操作:
#you can add to this
urls = ["http://www.google.com", "http://www.reddit.com"]
def parse_websites(list_of_urls):
for url in list_of_urls:
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
#print(text)
wiki = TextBlob(text)
r = wiki.sentiment.polarity
print r
parse_websites(urls)
我正在尝试解析文本并运行 对来自多个网站的文本进行情感分析。我已经成功地能够一次只剥离一个网站并使用 TextBlob 库生成一个情绪分数,但我正试图在许多网站上复制它,关于从哪里开始的任何想法?
代码如下:
import requests
import json
import urllib
from bs4 import BeautifulSoup
from textblob import TextBlob
url = "http://www.reddit.com/r/television/comments/38dqxf/josh_duggar_confessed_to_his_father_jim_bob/"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
#print(text)
wiki = TextBlob(text)
r = wiki.sentiment.polarity
print r
提前致谢
这是您通过 URL 在 python 中从网站获取数据的方式:
import urllib2
response = urllib2.urlopen('http://reddit.com/')
html = response.read()
html
是包含 URL.
我不完全确定您想从每一页中获得什么。如果您在下方发表评论,我可以编辑此答案并进一步帮助您。
编辑:
如果你想遍历 URL 的列表,你可以创建一个函数并像这样进行操作:
#you can add to this
urls = ["http://www.google.com", "http://www.reddit.com"]
def parse_websites(list_of_urls):
for url in list_of_urls:
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
#print(text)
wiki = TextBlob(text)
r = wiki.sentiment.polarity
print r
parse_websites(urls)