使用 "tr" 和 "td" 以及 BeautifulSoup 和 python 在 Wiki 上抓取
Scraping through on Wiki using "tr" and "td" with BeautifulSoup and python
这里共有 python3 个初学者。我似乎无法只打印出大学的名称。
class 不在大学名称附近,我似乎无法将 find_all 缩小到我需要的范围。并打印到一个新的 csv 文件。有什么想法吗?
import requests
from bs4 import BeautifulSoup
import csv
res= requests.get("https://en.wikipedia.org/wiki/Ivy_League")
soup = BeautifulSoup(res.text, "html.parser")
colleges = soup.find_all("table", class_ = "wikitable sortable")
for college in colleges:
first_level = college.find_all("tr")
print(first_level)
与:
colleges = soup.find_all("table", class_ = "wikitable sortable")
你得到的是 class 中的所有 table(有五个),而不是 table 中的所有大学。所以你可以这样做:
import requests
from bs4 import BeautifulSoup
res= requests.get("https://en.wikipedia.org/wiki/Ivy_League")
soup = BeautifulSoup(res.text, "html.parser")
college_table = soup.find("table", class_ = "wikitable sortable")
colleges = college_table.find_all("tr")
for college in colleges:
college_row = college.find('td')
college_link = college.find('a')
if college_link != None:
college_name = college_link.text
print(college_name)
编辑: 我添加了一个 if 来丢弃第一行,它有 table header
您可以使用 soup.select()
来利用 css 选择器并且更精确:
import requests
from bs4 import BeautifulSoup
res= requests.get("https://en.wikipedia.org/wiki/Ivy_League")
soup = BeautifulSoup(res.text, "html.parser")
l = soup.select(".mw-parser-output > table:nth-of-type(2) > tbody > tr > td:nth-of-type(1) a")
for each in l:
print(each.text)
打印结果:
Brown University
Columbia University
Cornell University
Dartmouth College
Harvard University
University of Pennsylvania
Princeton University
Yale University
要将单列放入 csv:
import pandas as pd
pd.DataFrame([e.text for e in l]).to_csv("your_csv.csv") # This will include index
这里共有 python3 个初学者。我似乎无法只打印出大学的名称。 class 不在大学名称附近,我似乎无法将 find_all 缩小到我需要的范围。并打印到一个新的 csv 文件。有什么想法吗?
import requests
from bs4 import BeautifulSoup
import csv
res= requests.get("https://en.wikipedia.org/wiki/Ivy_League")
soup = BeautifulSoup(res.text, "html.parser")
colleges = soup.find_all("table", class_ = "wikitable sortable")
for college in colleges:
first_level = college.find_all("tr")
print(first_level)
与:
colleges = soup.find_all("table", class_ = "wikitable sortable")
你得到的是 class 中的所有 table(有五个),而不是 table 中的所有大学。所以你可以这样做:
import requests
from bs4 import BeautifulSoup
res= requests.get("https://en.wikipedia.org/wiki/Ivy_League")
soup = BeautifulSoup(res.text, "html.parser")
college_table = soup.find("table", class_ = "wikitable sortable")
colleges = college_table.find_all("tr")
for college in colleges:
college_row = college.find('td')
college_link = college.find('a')
if college_link != None:
college_name = college_link.text
print(college_name)
编辑: 我添加了一个 if 来丢弃第一行,它有 table header
您可以使用 soup.select()
来利用 css 选择器并且更精确:
import requests
from bs4 import BeautifulSoup
res= requests.get("https://en.wikipedia.org/wiki/Ivy_League")
soup = BeautifulSoup(res.text, "html.parser")
l = soup.select(".mw-parser-output > table:nth-of-type(2) > tbody > tr > td:nth-of-type(1) a")
for each in l:
print(each.text)
打印结果:
Brown University
Columbia University
Cornell University
Dartmouth College
Harvard University
University of Pennsylvania
Princeton University
Yale University
要将单列放入 csv:
import pandas as pd
pd.DataFrame([e.text for e in l]).to_csv("your_csv.csv") # This will include index