在 Canopy 上使用 Python 进行网页抓取
Web-Scraping with Python on Canopy
我在处理这行代码时遇到问题,我想在其中打印所列公司的 4 个股票价格。我的问题是,虽然我 运行 它没有错误,但代码只打印出股票价格应该去的空括号。这是我困惑的根源。
import urllib2
import re
symbolslist = ["aapl","spy","goog","nflx"]
i = 0
while i<len(symbolslist):
url = "http://money.cnn.com/quote/quote.html?symb=' +symbolslist[i] + '"
htmlfile = urllib2.urlopen(url)
htmltext = htmlfile.read()
regex = '<span stream='+symbolslist[i]+' streamformat="ToHundredth" streamfeed="SunGard">(.+?)</span>'
pattern = re.compile(regex)
price = re.findall(pattern,htmltext)
print "the price of", symbolslist[i], " is ", price
i+=1
因为你没有传递变量:
url = "http://money.cnn.com/quote/quote.html?symb=' +symbolslist[i] + '"
^^^^^
a string not the list element
使用str.format:
url = "http://money.cnn.com/quote/quote.html?symb={}".format(symbolslist[i])
您也可以直接遍历列表,不需要 while 循环,永远不会 parse html with a regex, use a html parse like bs4 而且您的正则表达式也是错误的。没有 stream="aapl"
等。你想要的是 streamformat="ToHundredth"
和 streamfeed="SunGard"
;
的跨度
import urllib2
from bs4 import BeautifulSoup
symbolslist = ["aapl","spy","goog","nflx"]
for symbol in symbolslist:
url = "http://money.cnn.com/quote/quote.html?symb={}".format(symbol)
htmlfile = urllib2.urlopen(url)
soup = BeautifulSoup(htmlfile.read())
price = soup.find("span",streamformat="ToHundredth", streamfeed="SunGard").text
print "the price of {} is {}".format(symbol,price)
你可以看看我们是否运行代码:
In [1]: import urllib2
In [2]: from bs4 import BeautifulSoup
In [3]: symbols_list = ["aapl", "spy", "goog", "nflx"]
In [4]: for symbol in symbols_list:
...: url = "http://money.cnn.com/quote/quote.html?symb={}".format(symbol)
...: htmlfile = urllib2.urlopen(url)
...: soup = BeautifulSoup(htmlfile.read(), "html.parser")
...: price = soup.find("span",streamformat="ToHundredth", streamfeed="SunGard").text
...: print "the price of {} is {}".format(symbol,price)
...:
the price of aapl is 115.57
the price of spy is 215.28
the price of goog is 771.76
the price of nflx is 97.34
我们得到你想要的。
我在处理这行代码时遇到问题,我想在其中打印所列公司的 4 个股票价格。我的问题是,虽然我 运行 它没有错误,但代码只打印出股票价格应该去的空括号。这是我困惑的根源。
import urllib2
import re
symbolslist = ["aapl","spy","goog","nflx"]
i = 0
while i<len(symbolslist):
url = "http://money.cnn.com/quote/quote.html?symb=' +symbolslist[i] + '"
htmlfile = urllib2.urlopen(url)
htmltext = htmlfile.read()
regex = '<span stream='+symbolslist[i]+' streamformat="ToHundredth" streamfeed="SunGard">(.+?)</span>'
pattern = re.compile(regex)
price = re.findall(pattern,htmltext)
print "the price of", symbolslist[i], " is ", price
i+=1
因为你没有传递变量:
url = "http://money.cnn.com/quote/quote.html?symb=' +symbolslist[i] + '"
^^^^^
a string not the list element
使用str.format:
url = "http://money.cnn.com/quote/quote.html?symb={}".format(symbolslist[i])
您也可以直接遍历列表,不需要 while 循环,永远不会 parse html with a regex, use a html parse like bs4 而且您的正则表达式也是错误的。没有 stream="aapl"
等。你想要的是 streamformat="ToHundredth"
和 streamfeed="SunGard"
;
import urllib2
from bs4 import BeautifulSoup
symbolslist = ["aapl","spy","goog","nflx"]
for symbol in symbolslist:
url = "http://money.cnn.com/quote/quote.html?symb={}".format(symbol)
htmlfile = urllib2.urlopen(url)
soup = BeautifulSoup(htmlfile.read())
price = soup.find("span",streamformat="ToHundredth", streamfeed="SunGard").text
print "the price of {} is {}".format(symbol,price)
你可以看看我们是否运行代码:
In [1]: import urllib2
In [2]: from bs4 import BeautifulSoup
In [3]: symbols_list = ["aapl", "spy", "goog", "nflx"]
In [4]: for symbol in symbols_list:
...: url = "http://money.cnn.com/quote/quote.html?symb={}".format(symbol)
...: htmlfile = urllib2.urlopen(url)
...: soup = BeautifulSoup(htmlfile.read(), "html.parser")
...: price = soup.find("span",streamformat="ToHundredth", streamfeed="SunGard").text
...: print "the price of {} is {}".format(symbol,price)
...:
the price of aapl is 115.57
the price of spy is 215.28
the price of goog is 771.76
the price of nflx is 97.34
我们得到你想要的。