如何使用 Python 抓取多页 table 并导出到 CSV 文件?
How can I use Python to scrape a multipage table and export to a CSV file?
我正在尝试抓取跨越多个页面的 table 并导出到 csv 文件。似乎只有一行数据被导出,而且很乱。
我在网上看过并尝试了很多次迭代,现在非常沮丧。从代码中可以看出我是编码新手!
import bs4 as bs
import urllib.request
import pandas as pd
import csv
max_page_num = 14
max_page_dig = 1 # number of digits in the page number
with open('result.csv',"w") as f:
f.write("Name, Gender, State, Position, Grad, Club/HS, Rating, Commitment \n")
for i in range(0, max_page_num):
page_num = (max_page_dig - len(str(i))) * "0" +str(i) #gives a string in the format of 1, 01 or 001, 005 etc
print(page_num)
source = "https://www.topdrawersoccer.com/search/?query=&divisionId=&genderId=m&graduationYear=2020&positionId=0&playerRating=&stateId=All&pageNo=" + page_num + "&area=commitments"
print(source)
url = urllib.request.urlopen(source).read()
soup = bs.BeautifulSoup(url,'lxml')
table = soup.find('table')
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
#final = row.strip("\n")
#final = row.replace("\n","")
with open('result.csv', 'a') as f:
f.write(row)
似乎当我写入 csv 时它覆盖了以前的。它还将其粘贴在一行中,并将球员姓名与学校名称连接在一起。感谢您提供的所有帮助。
我认为你的内部 for
循环有问题。尝试将其重写为
with open('result.csv', 'a') as f:
for tr in table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
f.write(row)
看看它是否有效。
更一般地说,这可能可以通过使用 pandas 更简单地完成。尝试将 for
循环更改为:
for i in range(0, max_page_num):
page_num = ...
source = ....
df = pd.read_html(source)
df.to_csv('results.csv', header=False, index=False, mode='a') #'a' should append each table to the csv file, instead of overwriting it.
我正在尝试抓取跨越多个页面的 table 并导出到 csv 文件。似乎只有一行数据被导出,而且很乱。
我在网上看过并尝试了很多次迭代,现在非常沮丧。从代码中可以看出我是编码新手!
import bs4 as bs
import urllib.request
import pandas as pd
import csv
max_page_num = 14
max_page_dig = 1 # number of digits in the page number
with open('result.csv',"w") as f:
f.write("Name, Gender, State, Position, Grad, Club/HS, Rating, Commitment \n")
for i in range(0, max_page_num):
page_num = (max_page_dig - len(str(i))) * "0" +str(i) #gives a string in the format of 1, 01 or 001, 005 etc
print(page_num)
source = "https://www.topdrawersoccer.com/search/?query=&divisionId=&genderId=m&graduationYear=2020&positionId=0&playerRating=&stateId=All&pageNo=" + page_num + "&area=commitments"
print(source)
url = urllib.request.urlopen(source).read()
soup = bs.BeautifulSoup(url,'lxml')
table = soup.find('table')
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
#final = row.strip("\n")
#final = row.replace("\n","")
with open('result.csv', 'a') as f:
f.write(row)
似乎当我写入 csv 时它覆盖了以前的。它还将其粘贴在一行中,并将球员姓名与学校名称连接在一起。感谢您提供的所有帮助。
我认为你的内部 for
循环有问题。尝试将其重写为
with open('result.csv', 'a') as f:
for tr in table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
f.write(row)
看看它是否有效。
更一般地说,这可能可以通过使用 pandas 更简单地完成。尝试将 for
循环更改为:
for i in range(0, max_page_num):
page_num = ...
source = ....
df = pd.read_html(source)
df.to_csv('results.csv', header=False, index=False, mode='a') #'a' should append each table to the csv file, instead of overwriting it.