我需要从网页中检索信息。我需要从带有标签“<span style="font-size:12px;">”的代码中的 url 中获取玩家名称

Question

import urllib.request
from urllib.request import Request, urlopen

url = "http://www.hltv.org/match/2294502-clg-liquid-esea-invite-season-18-na"

#sock = urllib.request.urlopen(url)
sock = Request(url, "headers={'User-Agent': 'Mozilla/5.0'}")

#myhtml = sock.read()
myhtml = urlopen(sock).read()

for item in myhtml.split("</span>"):
    if '<span style="font-size:12px;">' in item:
            print (item [ item.find('<span style="font-size:12px;">' + len('<tag>')) : ])

这是我编译和运行代码时分出的错误。

Traceback (most recent call last):
  File "Z:/hltv.py", line 10, in <module>
    myhtml = urlopen(sock).read()
  File "C:\Python34\lib\urllib\request.py", line 153, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python34\lib\urllib\request.py", line 453, in open
    req = meth(req)
  File "C:\Python34\lib\urllib\request.py", line 1104, in do_request_
    raise TypeError(msg)
TypeError: POST data should be bytes or an iterable of bytes. It cannot be of type str.

我是 python 的新手，所以请尽量简化修复，谢谢。（目前正在使用 python 3.x）

Answer 1

我用 Python 2

重写了它

注意最后一行的括号！

应该是 item.find(TAG) + len(TAG) ，不是 item.find(TAG + len(TAG)) 在您的代码中！

# -*- coding: utf-8-*-
import urllib2

req = urllib2.Request("http://www.hltv.org/match/2294502-clg-liquid-esea-invite-season-18-na")
req.add_header('User-Agent', 'Mozilla/5.0')

response = urllib2.urlopen(req)
the_page = response.read()

TAG = '<span style="font-size:12px;">'
for item in the_page.split("</span>"):
    if TAG in item:
        print (item [ item.find(TAG) + len(TAG) : ])

结果

hazed
ptr
FNS
tarik
reltuC
nitr0
adreN
FugLy
NAF-FLY
daps

注：

BeautifualSoup更适合HTML内容的复杂查询。

Answer 2

这不是解析 html 的正确方法。有满足您确切要求的标准库，例如 BeautifulSoup or lxml 。 BeautifulSoup 具有针对 select 标签等的各种 API。

例如：

import urllib2
from bs4 import BeautifulSoup

req = urllib2.Request("http://www.hltv.org/match/2294502-clg-liquid-esea-invite-season-18-na")
req.add_header('User-Agent', 'Mozilla/5.0')

response = urllib2.urlopen(req)
the_page = response.read()
soup = soup = BeautifulSoup(the_page)


#To select all the span tags
span_tags = soup.find_all("span")

#To get the player names
player_names = soup.find_all("span" ,attrs={"style":"font-size:12px;"})

我需要从网页中检索信息。我需要从带有标签“<span style="font-size:12px;">”的代码中的 url 中获取玩家名称

I need to retrieve information from a web page. I need to get the player names from the url in the code with the tag '<span style="font-size:12px;">'

python

web-scraping

python-2.7

结果

注：