我需要从网页中检索信息。我需要从带有标签“<span style="font-size:12px;">”的代码中的 url 中获取玩家名称
I need to retrieve information from a web page. I need to get the player names from the url in the code with the tag '<span style="font-size:12px;">'
import urllib.request
from urllib.request import Request, urlopen
url = "http://www.hltv.org/match/2294502-clg-liquid-esea-invite-season-18-na"
#sock = urllib.request.urlopen(url)
sock = Request(url, "headers={'User-Agent': 'Mozilla/5.0'}")
#myhtml = sock.read()
myhtml = urlopen(sock).read()
for item in myhtml.split("</span>"):
if '<span style="font-size:12px;">' in item:
print (item [ item.find('<span style="font-size:12px;">' + len('<tag>')) : ])
这是我编译和运行代码时分出的错误。
Traceback (most recent call last):
File "Z:/hltv.py", line 10, in <module>
myhtml = urlopen(sock).read()
File "C:\Python34\lib\urllib\request.py", line 153, in urlopen
return opener.open(url, data, timeout)
File "C:\Python34\lib\urllib\request.py", line 453, in open
req = meth(req)
File "C:\Python34\lib\urllib\request.py", line 1104, in do_request_
raise TypeError(msg)
TypeError: POST data should be bytes or an iterable of bytes. It cannot be of type str.
我是 python 的新手,所以请尽量简化修复,谢谢。 (目前正在使用 python 3.x)
我用 Python 2
重写了它
注意最后一行的括号!
应该是
item.find(TAG) + len(TAG)
, 不是
item.find(TAG + len(TAG))
在您的代码中!
# -*- coding: utf-8-*-
import urllib2
req = urllib2.Request("http://www.hltv.org/match/2294502-clg-liquid-esea-invite-season-18-na")
req.add_header('User-Agent', 'Mozilla/5.0')
response = urllib2.urlopen(req)
the_page = response.read()
TAG = '<span style="font-size:12px;">'
for item in the_page.split("</span>"):
if TAG in item:
print (item [ item.find(TAG) + len(TAG) : ])
结果
hazed
ptr
FNS
tarik
reltuC
nitr0
adreN
FugLy
NAF-FLY
daps
注:
BeautifualSoup更适合HTML内容的复杂查询。
这不是解析 html
的正确方法。有满足您确切要求的标准库,例如 BeautifulSoup or lxml 。 BeautifulSoup 具有针对 select 标签等的各种 API。
例如:
import urllib2
from bs4 import BeautifulSoup
req = urllib2.Request("http://www.hltv.org/match/2294502-clg-liquid-esea-invite-season-18-na")
req.add_header('User-Agent', 'Mozilla/5.0')
response = urllib2.urlopen(req)
the_page = response.read()
soup = soup = BeautifulSoup(the_page)
#To select all the span tags
span_tags = soup.find_all("span")
#To get the player names
player_names = soup.find_all("span" ,attrs={"style":"font-size:12px;"})
import urllib.request
from urllib.request import Request, urlopen
url = "http://www.hltv.org/match/2294502-clg-liquid-esea-invite-season-18-na"
#sock = urllib.request.urlopen(url)
sock = Request(url, "headers={'User-Agent': 'Mozilla/5.0'}")
#myhtml = sock.read()
myhtml = urlopen(sock).read()
for item in myhtml.split("</span>"):
if '<span style="font-size:12px;">' in item:
print (item [ item.find('<span style="font-size:12px;">' + len('<tag>')) : ])
这是我编译和运行代码时分出的错误。
Traceback (most recent call last):
File "Z:/hltv.py", line 10, in <module>
myhtml = urlopen(sock).read()
File "C:\Python34\lib\urllib\request.py", line 153, in urlopen
return opener.open(url, data, timeout)
File "C:\Python34\lib\urllib\request.py", line 453, in open
req = meth(req)
File "C:\Python34\lib\urllib\request.py", line 1104, in do_request_
raise TypeError(msg)
TypeError: POST data should be bytes or an iterable of bytes. It cannot be of type str.
我是 python 的新手,所以请尽量简化修复,谢谢。 (目前正在使用 python 3.x)
我用 Python 2
重写了它注意最后一行的括号!
应该是
item.find(TAG) + len(TAG)
, 不是
item.find(TAG + len(TAG))
在您的代码中!
# -*- coding: utf-8-*-
import urllib2
req = urllib2.Request("http://www.hltv.org/match/2294502-clg-liquid-esea-invite-season-18-na")
req.add_header('User-Agent', 'Mozilla/5.0')
response = urllib2.urlopen(req)
the_page = response.read()
TAG = '<span style="font-size:12px;">'
for item in the_page.split("</span>"):
if TAG in item:
print (item [ item.find(TAG) + len(TAG) : ])
结果
hazed
ptr
FNS
tarik
reltuC
nitr0
adreN
FugLy
NAF-FLY
daps
注:
BeautifualSoup更适合HTML内容的复杂查询。
这不是解析 html
的正确方法。有满足您确切要求的标准库,例如 BeautifulSoup or lxml 。 BeautifulSoup 具有针对 select 标签等的各种 API。
例如:
import urllib2
from bs4 import BeautifulSoup
req = urllib2.Request("http://www.hltv.org/match/2294502-clg-liquid-esea-invite-season-18-na")
req.add_header('User-Agent', 'Mozilla/5.0')
response = urllib2.urlopen(req)
the_page = response.read()
soup = soup = BeautifulSoup(the_page)
#To select all the span tags
span_tags = soup.find_all("span")
#To get the player names
player_names = soup.find_all("span" ,attrs={"style":"font-size:12px;"})