如何从嵌套的 <dl><dt> 中获取文本?
How can I get text from within a nested <dl><dt>?
我是网络抓取的新手,所以如果我有任何误解,我提前道歉...
我正在尝试从 ESPN 获取数据。这是我的 python 代码:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'http://espn.go.com/nba/teams'
r = requests.get(url)
soup = BeautifulSoup(r.text)
tables = soup.find_all('dl')
teams = []
prefix_1 = []
prefix_2 = []
teams_urls = []
for table in tables:
lis = table.find_all('dt', text=False)
print lis
for li in lis:
info = dt
teams.append(info.text)
url = info['href']
teams_urls.append(url)
prefix_1.append(url.split('/')[-2])
prefix_2.append(url.split('/')[-1])
print (teams)
当我在不同点打印时,我得到空括号 [] 作为 return。请帮忙。谢谢
您正在从菜单中提取团队名称,但实际页面内容也包含团队。
让我们使用 CSS selectors 访问页面上的每个团队 link。因此,让我们构建一个包含团队名称和 url 的字典列表:
import requests
from bs4 import BeautifulSoup
url = 'http://espn.go.com/nba/teams'
r = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17'})
soup = BeautifulSoup(r.content, 'lxml')
teams = []
for link in soup.select('div.mod-table div.mod-content ul li h5 a[href]'):
teams.append({
'name': link.text,
'url': link['href']
})
print(teams)
打印:
[
{'name': u'Boston Celtics', 'url': 'http://espn.go.com/nba/team/_/name/bos/boston-celtics'},
{'name': u'Brooklyn Nets', 'url': 'http://espn.go.com/nba/team/_/name/bkn/brooklyn-nets'},
...
{'name': u'Utah Jazz', 'url': 'http://espn.go.com/nba/team/_/name/utah/utah-jazz'}
]
我是网络抓取的新手,所以如果我有任何误解,我提前道歉...
我正在尝试从 ESPN 获取数据。这是我的 python 代码:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'http://espn.go.com/nba/teams'
r = requests.get(url)
soup = BeautifulSoup(r.text)
tables = soup.find_all('dl')
teams = []
prefix_1 = []
prefix_2 = []
teams_urls = []
for table in tables:
lis = table.find_all('dt', text=False)
print lis
for li in lis:
info = dt
teams.append(info.text)
url = info['href']
teams_urls.append(url)
prefix_1.append(url.split('/')[-2])
prefix_2.append(url.split('/')[-1])
print (teams)
当我在不同点打印时,我得到空括号 [] 作为 return。请帮忙。谢谢
您正在从菜单中提取团队名称,但实际页面内容也包含团队。
让我们使用 CSS selectors 访问页面上的每个团队 link。因此,让我们构建一个包含团队名称和 url 的字典列表:
import requests
from bs4 import BeautifulSoup
url = 'http://espn.go.com/nba/teams'
r = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17'})
soup = BeautifulSoup(r.content, 'lxml')
teams = []
for link in soup.select('div.mod-table div.mod-content ul li h5 a[href]'):
teams.append({
'name': link.text,
'url': link['href']
})
print(teams)
打印:
[
{'name': u'Boston Celtics', 'url': 'http://espn.go.com/nba/team/_/name/bos/boston-celtics'},
{'name': u'Brooklyn Nets', 'url': 'http://espn.go.com/nba/team/_/name/bkn/brooklyn-nets'},
...
{'name': u'Utah Jazz', 'url': 'http://espn.go.com/nba/team/_/name/utah/utah-jazz'}
]