Python 给了我一个 table 的两列,但我只希望它给我其中一列
Python is giving me both columns of a table I a scraping, but I only want it to give me one of the columns
我正在使用 Python 从 Ballotpedia (https://ballotpedia.org/Alaska_Supreme_Court) 中抓取阿拉斯加最高法院法官的名字。我当前的代码在“任命者”栏中为我提供了法官的姓名以及人员的姓名。这是我当前的代码:
import requests
from bs4 import BeautifulSoup
import pandas as pd
list = ['https://ballotpedia.org/Alaska_Supreme_Court']
temp_dict = {}
for page in list:
r = requests.get(page)
soup = BeautifulSoup(r.content, 'html.parser')
temp_dict[page.split('/')[-1]] = [item.text for item in soup.select("table.wikitable.sortable.jquery-tablesorter a")]
df = pd.DataFrame.from_dict(temp_dict,
orient='index').transpose()
df.to_csv('18-TEST.csv')
我一直在尝试使用这条线:
temp_dict[page.split('/')[-1]] = [item.text for item in soup.select("table.wikitable.sortable.jquery-tablesorter a")]
我对在网页上使用检查功能有点缺乏经验,所以当我尝试在“分拣机”。在这一点上我有点迷茫,并且无法找到这方面的资源。你能帮我让 python 给我 judge 列而不是 appointment by 列吗?谢谢!
首先,请注意这是从 here.
中提取的代码
现在,如果您不知道有多少行或多少列,这将为您提供一个包含所有列的数据框,对应于网页上的 table。如果不需要,请随意删除其中一列。
import requests
from bs4 import BeautifulSoup
import pandas as pd
# I'll do it for the one page example
page = 'https://ballotpedia.org/Alaska_Supreme_Court'
temp_dict = {}
r = requests.get(page)
soup = BeautifulSoup(r.content, 'html.parser')
# this finds the first table with the class specified
table = soup.find('table', attrs={'class':'wikitable sortable jquery-tablesorter'})
# get all rows of the above table
rows = table.find_all('tr')
data = []
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele])
# turn it into a pandas dataframe
df = pd.DataFrame(data)
有不同的选项来获得结果。
选项#1
对列表进行切片并选择每隔一个元素:
soup.select("table.wikitable.sortable.jquery-tablesorter a")][0::2]
示例:
import requests
from bs4 import BeautifulSoup
import pandas as pd
lst = ['https://ballotpedia.org/Alaska_Supreme_Court']
temp_dict = {}
for page in lst:
r = requests.get(page)
soup = BeautifulSoup(r.content, 'html.parser')
temp_dict[page.split('/')[-1]] = [item.text for item in soup.select("table.wikitable.sortable.jquery-tablesorter a")][0::2]
pd.DataFrame.from_dict(temp_dict, orient='index').transpose().to_csv('18-TEST.csv', index=False)
选项#2
使您的 selection 更具体,并且 select 仅是 tr
中的第一个 td
:
soup.select("table.wikitable.sortable.jquery-tablesorter tr > td:nth-of-type(1)")]
例子
import requests
from bs4 import BeautifulSoup
import pandas as pd
list = ['https://ballotpedia.org/Alaska_Supreme_Court']
temp_dict = {}
for page in list:
r = requests.get(page)
soup = BeautifulSoup(r.content, 'html.parser')
temp_dict[page.split('/')[-1]] = [item.text for item in soup.select("table.wikitable.sortable.jquery-tablesorter tr > td:nth-of-type(1)")]
pd.DataFrame.from_dict(temp_dict, orient='index').transpose().to_csv('18-TEST.csv', index=False)
选项#3
使用 pandas
功能 read_html()
例子
import pandas as pd
df = pd.read_html('https://ballotpedia.org/Alaska_Supreme_Court')[2]
df.Judge.to_csv('18-TEST.csv', index=False)
我想分享另一种方法,让您 table 获得所需的格式:
import pandas as pd
# extracting table and making it dataframe
frame = pd.read_html('https://ballotpedia.org/Alaska_Supreme_Court',attrs={"class":"wikitable sortable jquery-tablesorter"})[0]
# drop unwanted columns
frame.drop("Appointed By", axis=1, inplace=True)
# save dataframe as csv
frame.to_csv("desired/path/output.csv", index=False)
打印 frame 将输出如下:
|法官|
|-----|
|丹尼尔·温弗里|
|乔尔·哈罗德·博尔格|
|彼得·乔恩·马森|
|苏珊·卡尼|
|达里奥博格桑|
我正在使用 Python 从 Ballotpedia (https://ballotpedia.org/Alaska_Supreme_Court) 中抓取阿拉斯加最高法院法官的名字。我当前的代码在“任命者”栏中为我提供了法官的姓名以及人员的姓名。这是我当前的代码:
import requests
from bs4 import BeautifulSoup
import pandas as pd
list = ['https://ballotpedia.org/Alaska_Supreme_Court']
temp_dict = {}
for page in list:
r = requests.get(page)
soup = BeautifulSoup(r.content, 'html.parser')
temp_dict[page.split('/')[-1]] = [item.text for item in soup.select("table.wikitable.sortable.jquery-tablesorter a")]
df = pd.DataFrame.from_dict(temp_dict,
orient='index').transpose()
df.to_csv('18-TEST.csv')
我一直在尝试使用这条线:
temp_dict[page.split('/')[-1]] = [item.text for item in soup.select("table.wikitable.sortable.jquery-tablesorter a")]
我对在网页上使用检查功能有点缺乏经验,所以当我尝试在“分拣机”。在这一点上我有点迷茫,并且无法找到这方面的资源。你能帮我让 python 给我 judge 列而不是 appointment by 列吗?谢谢!
首先,请注意这是从 here.
中提取的代码现在,如果您不知道有多少行或多少列,这将为您提供一个包含所有列的数据框,对应于网页上的 table。如果不需要,请随意删除其中一列。
import requests
from bs4 import BeautifulSoup
import pandas as pd
# I'll do it for the one page example
page = 'https://ballotpedia.org/Alaska_Supreme_Court'
temp_dict = {}
r = requests.get(page)
soup = BeautifulSoup(r.content, 'html.parser')
# this finds the first table with the class specified
table = soup.find('table', attrs={'class':'wikitable sortable jquery-tablesorter'})
# get all rows of the above table
rows = table.find_all('tr')
data = []
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele])
# turn it into a pandas dataframe
df = pd.DataFrame(data)
有不同的选项来获得结果。
选项#1
对列表进行切片并选择每隔一个元素:
soup.select("table.wikitable.sortable.jquery-tablesorter a")][0::2]
示例:
import requests
from bs4 import BeautifulSoup
import pandas as pd
lst = ['https://ballotpedia.org/Alaska_Supreme_Court']
temp_dict = {}
for page in lst:
r = requests.get(page)
soup = BeautifulSoup(r.content, 'html.parser')
temp_dict[page.split('/')[-1]] = [item.text for item in soup.select("table.wikitable.sortable.jquery-tablesorter a")][0::2]
pd.DataFrame.from_dict(temp_dict, orient='index').transpose().to_csv('18-TEST.csv', index=False)
选项#2
使您的 selection 更具体,并且 select 仅是 tr
中的第一个 td
:
soup.select("table.wikitable.sortable.jquery-tablesorter tr > td:nth-of-type(1)")]
例子
import requests
from bs4 import BeautifulSoup
import pandas as pd
list = ['https://ballotpedia.org/Alaska_Supreme_Court']
temp_dict = {}
for page in list:
r = requests.get(page)
soup = BeautifulSoup(r.content, 'html.parser')
temp_dict[page.split('/')[-1]] = [item.text for item in soup.select("table.wikitable.sortable.jquery-tablesorter tr > td:nth-of-type(1)")]
pd.DataFrame.from_dict(temp_dict, orient='index').transpose().to_csv('18-TEST.csv', index=False)
选项#3
使用 pandas
功能 read_html()
例子
import pandas as pd
df = pd.read_html('https://ballotpedia.org/Alaska_Supreme_Court')[2]
df.Judge.to_csv('18-TEST.csv', index=False)
我想分享另一种方法,让您 table 获得所需的格式:
import pandas as pd
# extracting table and making it dataframe
frame = pd.read_html('https://ballotpedia.org/Alaska_Supreme_Court',attrs={"class":"wikitable sortable jquery-tablesorter"})[0]
# drop unwanted columns
frame.drop("Appointed By", axis=1, inplace=True)
# save dataframe as csv
frame.to_csv("desired/path/output.csv", index=False)
打印 frame 将输出如下: |法官| |-----| |丹尼尔·温弗里| |乔尔·哈罗德·博尔格| |彼得·乔恩·马森| |苏珊·卡尼| |达里奥博格桑|