逻辑流程 - 尝试使用 BeautifulSoup 和 CSV Writer 遍历网站页面
Logic flow - trying to iterate through website pages with BeautifulSoup and CSV Writer
我似乎无法找出合适的 indents/clause 位置来让它循环超过 1 页。此代码当前可以很好地打印出 CSV 文件,但只打印第一页。
#THIS WORKS BUT ONLY PRINTS THE FIRST PAGE
from bs4 import BeautifulSoup
from urllib2 import urlopen
import csv
page_num = 1
total_pages = 20
with open("MegaMillions.tsv","w") as f:
fieldnames = ['date', 'numbers', 'moneyball']
writer = csv.writer(f, delimiter = '\t')
writer.writerow(fieldnames)
while page_num < total_pages:
page_num = str(page_num)
soup = BeautifulSoup(urlopen('http://www.usamega.com/mega-millions-history.asp?p='+page_num).read())
for row in soup('table',{'bgcolor':'white'})[0].findAll('tr'):
tds = row('td')
if tds[1].a is not None:
date = tds[1].a.string.encode("utf-8")
if tds[3].b is not None:
uglynumber = tds[3].b.string.split()
betternumber = [int(uglynumber[i]) for i in range(len(uglynumber)) if i%2==0]
moneyball = tds[3].strong.string.encode("utf-8")
writer.writerow([date, betternumber, moneyball])
page_num = int(page_num)
page_num += 1
print 'We\'re done here.'
当然,这只会打印最后一页:
#THIS WORKS BUT ONLY PRINTS THE LAST PAGE
from bs4 import BeautifulSoup
from urllib2 import urlopen
import csv
page_num = 1
total_pages = 20
while page_num < total_pages:
page_num = str(page_num)
soup = BeautifulSoup(urlopen('http://www.usamega.com/mega-millions-history.asp?p='+page_num).read())
with open("MegaMillions.tsv","w") as f:
fieldnames = ['date', 'numbers', 'moneyball']
writer = csv.writer(f, delimiter = '\t')
writer.writerow(fieldnames)
for row in soup('table',{'bgcolor':'white'})[0].findAll('tr'):
tds = row('td')
if tds[1].a is not None:
date = tds[1].a.string.encode("utf-8")
if tds[3].b is not None:
uglynumber = tds[3].b.string.split()
betternumber = [int(uglynumber[i]) for i in range(len(uglynumber)) if i%2==0]
moneyball = tds[3].strong.string.encode("utf-8")
writer.writerow([date, betternumber, moneyball])
page_num = int(page_num)
page_num += 1
print 'We\'re done here.'
你的第二个代码示例的问题是你每次都在覆盖你的文件。而不是
open("MegaMillions.tsv","w")
使用
open("MegaMillions.tsv","a")
"a"打开文件进行追加,这就是你想要做的
多亏了这些建议,这里有一个可行的变体:
from bs4 import BeautifulSoup
from urllib2 import urlopen
import csv
page_num = 1
total_pages = 73
with open("MegaMillions.tsv","w") as f:
fieldnames = ['date', 'numbers', 'moneyball']
writer = csv.writer(f, delimiter = '\t')
writer.writerow(fieldnames)
while page_num <= total_pages:
page_num = str(page_num)
soup = BeautifulSoup(urlopen('http://www.usamega.com/mega-millions-history.asp?p='+page_num).read())
for row in soup('table',{'bgcolor':'white'})[0].findAll('tr'):
tds = row('td')
if tds[1].a is not None:
date = tds[1].a.string.encode("utf-8")
if tds[3].b is not None:
uglynumber = tds[3].b.string.split()
betternumber = [int(uglynumber[i]) for i in range(len(uglynumber)) if i%2==0]
moneyball = tds[3].strong.string.encode("utf-8")
writer.writerow([date, betternumber, moneyball])
page_num = int(page_num)
page_num += 1
print 'We\'re done here.'
在 'a' 上选择了这个,因为这样标题行就会被写入每一页。
我似乎无法找出合适的 indents/clause 位置来让它循环超过 1 页。此代码当前可以很好地打印出 CSV 文件,但只打印第一页。
#THIS WORKS BUT ONLY PRINTS THE FIRST PAGE
from bs4 import BeautifulSoup
from urllib2 import urlopen
import csv
page_num = 1
total_pages = 20
with open("MegaMillions.tsv","w") as f:
fieldnames = ['date', 'numbers', 'moneyball']
writer = csv.writer(f, delimiter = '\t')
writer.writerow(fieldnames)
while page_num < total_pages:
page_num = str(page_num)
soup = BeautifulSoup(urlopen('http://www.usamega.com/mega-millions-history.asp?p='+page_num).read())
for row in soup('table',{'bgcolor':'white'})[0].findAll('tr'):
tds = row('td')
if tds[1].a is not None:
date = tds[1].a.string.encode("utf-8")
if tds[3].b is not None:
uglynumber = tds[3].b.string.split()
betternumber = [int(uglynumber[i]) for i in range(len(uglynumber)) if i%2==0]
moneyball = tds[3].strong.string.encode("utf-8")
writer.writerow([date, betternumber, moneyball])
page_num = int(page_num)
page_num += 1
print 'We\'re done here.'
当然,这只会打印最后一页:
#THIS WORKS BUT ONLY PRINTS THE LAST PAGE
from bs4 import BeautifulSoup
from urllib2 import urlopen
import csv
page_num = 1
total_pages = 20
while page_num < total_pages:
page_num = str(page_num)
soup = BeautifulSoup(urlopen('http://www.usamega.com/mega-millions-history.asp?p='+page_num).read())
with open("MegaMillions.tsv","w") as f:
fieldnames = ['date', 'numbers', 'moneyball']
writer = csv.writer(f, delimiter = '\t')
writer.writerow(fieldnames)
for row in soup('table',{'bgcolor':'white'})[0].findAll('tr'):
tds = row('td')
if tds[1].a is not None:
date = tds[1].a.string.encode("utf-8")
if tds[3].b is not None:
uglynumber = tds[3].b.string.split()
betternumber = [int(uglynumber[i]) for i in range(len(uglynumber)) if i%2==0]
moneyball = tds[3].strong.string.encode("utf-8")
writer.writerow([date, betternumber, moneyball])
page_num = int(page_num)
page_num += 1
print 'We\'re done here.'
你的第二个代码示例的问题是你每次都在覆盖你的文件。而不是
open("MegaMillions.tsv","w")
使用
open("MegaMillions.tsv","a")
"a"打开文件进行追加,这就是你想要做的
多亏了这些建议,这里有一个可行的变体:
from bs4 import BeautifulSoup
from urllib2 import urlopen
import csv
page_num = 1
total_pages = 73
with open("MegaMillions.tsv","w") as f:
fieldnames = ['date', 'numbers', 'moneyball']
writer = csv.writer(f, delimiter = '\t')
writer.writerow(fieldnames)
while page_num <= total_pages:
page_num = str(page_num)
soup = BeautifulSoup(urlopen('http://www.usamega.com/mega-millions-history.asp?p='+page_num).read())
for row in soup('table',{'bgcolor':'white'})[0].findAll('tr'):
tds = row('td')
if tds[1].a is not None:
date = tds[1].a.string.encode("utf-8")
if tds[3].b is not None:
uglynumber = tds[3].b.string.split()
betternumber = [int(uglynumber[i]) for i in range(len(uglynumber)) if i%2==0]
moneyball = tds[3].strong.string.encode("utf-8")
writer.writerow([date, betternumber, moneyball])
page_num = int(page_num)
page_num += 1
print 'We\'re done here.'
在 'a' 上选择了这个,因为这样标题行就会被写入每一页。