当我抓取 table 时,如何避免来自不同选项卡的数据连接在一个单元格中?
How do I avoid data from different tabs to be concatenated in one cell when I scrape a table?
我抓取了此页面 https://www.capfriendly.com/teams/bruins,专门在 Cap Hit(前锋、防守、守门员)选项卡下寻找 tables。
我使用 Python 和 BeautifulSoup4 以及 CSV 作为输出格式。
import requests, bs4
r = requests.get('https://www.capfriendly.com/teams/bruins')
soup = bs4.BeautifulSoup(r.text, 'lxml')
table = soup.find(id="team")
with open("csvfile.csv", "w", newline='') as team_data:
for tr in table('tr', class_=['odd', 'even']): # get all tr whose class is odd or even
row = [td.text for td in tr('td')] # extract td's text
writer = csv.writer(team_data)
writer.writerow(row)
这是我得到的输出:
['Krejci, David "A"', 'NMC', 'C', 'NHL', '30', ',250,000,250,000NMC', ',250,000,500,000NMC', ',250,000,500,000NMC', ',250,000,000,000Modified NTC', ',250,000,000,000Modified NTC', 'UFA', '']
['Bergeron, Patrice "A"', 'NMC', 'C', 'NHL', '31', ',875,000,750,000NMC', ',875,000,750,000NMC', ',875,000,875,000,000,000NMC', ',875,000,375,000,500,000NMC', ',875,000,375,000,000,000Modified NTC, NMC', ',875,000,375,000,000,000Modified NTC, NMC', 'UFA']
['Backes, David', 'NMC', 'C, RW', 'NHL', '32', ',000,000,000,000,000,000NMC', ',000,000,000,000,000,000NMC', ',000,000,000,000,000,000NMC', ',000,000,000,000,000,000Modified NTC', ',000,000,000,000,000,000Modified NTC', 'UFA', '']
['Marchand, Brad', 'M-NTC', 'LW', 'NHL', '28', ',500,000,000,000Modified NTC', ',125,000,000,000,000,000NMC', ',125,000,000,000,000,000NMC', ',125,000,500,000,000,000NMC', ',125,000,000,000,000,000NMC', ',125,000,500,000,000,000NMC', ',125,000,000,000,000,000Modified NTC']
如您所见,来自不同选项卡的数据连接在一起:
',250,000,000,000Modified NTC'
有人建议我使用 javascript 来抓取 table,它应该可以解决我的问题吗?
根据源代码,这是特定行中的一些文本,根据您所在的标签(如您的标题所述)有条件地可见。 class .hide
被添加到 td
中的 child 元素,当它打算隐藏在该特定选项卡上时。
当您解析 td
元素以检索文本时,您可以过滤掉那些假定隐藏的元素。这样做时,您可以检索可见的文本,就像您在 Web 浏览器中查看页面一样。
在下面的代码片段中,我添加了一个 parse_td
函数,该函数过滤掉 class 为 hide
的 children span
元素。从那里返回相应的文本。
import requests, bs4, csv
r = requests.get('https://www.capfriendly.com/teams/bruins')
soup = bs4.BeautifulSoup(r.text, 'lxml')
table = soup.find(id="team")
with open("csvfile.csv", "w", newline='') as team_data:
def parse_td(td):
filtered_data = [tag.text for tag in td.find_all('span', recursive=False)
if 'hide' not in tag.attrs['class']]
return filtered_data[0] if filtered_data else td.text;
for tr in table('tr', class_=['odd', 'even']):
row = [parse_td(td) for td in tr('td')]
writer = csv.writer(team_data)
writer.writerow(row)
我抓取了此页面 https://www.capfriendly.com/teams/bruins,专门在 Cap Hit(前锋、防守、守门员)选项卡下寻找 tables。
我使用 Python 和 BeautifulSoup4 以及 CSV 作为输出格式。
import requests, bs4
r = requests.get('https://www.capfriendly.com/teams/bruins')
soup = bs4.BeautifulSoup(r.text, 'lxml')
table = soup.find(id="team")
with open("csvfile.csv", "w", newline='') as team_data:
for tr in table('tr', class_=['odd', 'even']): # get all tr whose class is odd or even
row = [td.text for td in tr('td')] # extract td's text
writer = csv.writer(team_data)
writer.writerow(row)
这是我得到的输出:
['Krejci, David "A"', 'NMC', 'C', 'NHL', '30', ',250,000,250,000NMC', ',250,000,500,000NMC', ',250,000,500,000NMC', ',250,000,000,000Modified NTC', ',250,000,000,000Modified NTC', 'UFA', '']
['Bergeron, Patrice "A"', 'NMC', 'C', 'NHL', '31', ',875,000,750,000NMC', ',875,000,750,000NMC', ',875,000,875,000,000,000NMC', ',875,000,375,000,500,000NMC', ',875,000,375,000,000,000Modified NTC, NMC', ',875,000,375,000,000,000Modified NTC, NMC', 'UFA']
['Backes, David', 'NMC', 'C, RW', 'NHL', '32', ',000,000,000,000,000,000NMC', ',000,000,000,000,000,000NMC', ',000,000,000,000,000,000NMC', ',000,000,000,000,000,000Modified NTC', ',000,000,000,000,000,000Modified NTC', 'UFA', '']
['Marchand, Brad', 'M-NTC', 'LW', 'NHL', '28', ',500,000,000,000Modified NTC', ',125,000,000,000,000,000NMC', ',125,000,000,000,000,000NMC', ',125,000,500,000,000,000NMC', ',125,000,000,000,000,000NMC', ',125,000,500,000,000,000NMC', ',125,000,000,000,000,000Modified NTC']
如您所见,来自不同选项卡的数据连接在一起:
',250,000,000,000Modified NTC'
有人建议我使用 javascript 来抓取 table,它应该可以解决我的问题吗?
根据源代码,这是特定行中的一些文本,根据您所在的标签(如您的标题所述)有条件地可见。 class .hide
被添加到 td
中的 child 元素,当它打算隐藏在该特定选项卡上时。
当您解析 td
元素以检索文本时,您可以过滤掉那些假定隐藏的元素。这样做时,您可以检索可见的文本,就像您在 Web 浏览器中查看页面一样。
在下面的代码片段中,我添加了一个 parse_td
函数,该函数过滤掉 class 为 hide
的 children span
元素。从那里返回相应的文本。
import requests, bs4, csv
r = requests.get('https://www.capfriendly.com/teams/bruins')
soup = bs4.BeautifulSoup(r.text, 'lxml')
table = soup.find(id="team")
with open("csvfile.csv", "w", newline='') as team_data:
def parse_td(td):
filtered_data = [tag.text for tag in td.find_all('span', recursive=False)
if 'hide' not in tag.attrs['class']]
return filtered_data[0] if filtered_data else td.text;
for tr in table('tr', class_=['odd', 'even']):
row = [parse_td(td) for td in tr('td')]
writer = csv.writer(team_data)
writer.writerow(row)