如何从比分直播中抓取足球成绩？

Question

我有这个项目正在使用 python 3.4。我想抓取 livescore.com 以获得足球比分（结果），例如获取当天的所有比分（英格兰 2-2 挪威，法国 2-1 意大利等）。我正在用 python 3.4, windows 10 64bit os.

构建它

我试过两种方法，这是代码：

import bs4 as bs
import urllib.request

sauce = urllib.request.urlopen('http://www.livescore.com/').read()
soup = bs.BeautifulSoup(sauce,'lxml')

for div in soup.find_all('div', class_='container'):
    print(div.text)

当我运行这段代码时，一只盒子小狗说：

IDLE's subprocess didn't make connection. Either IDLE can't start a subprocess or firewall software is blocking the connection.

我决定再写一个这是代码：

# Import Modules
import urllib.request
import re

# Downloading Live Score XML Code From Website and reading also
xml_data = urllib.request.urlopen('http://static.cricinfo.com/rss/livescores.xml').read()

# Pattern For Searching Score and link
pattern = "<item>(.*?)</item>"

# Finding Matches
for i in re.findall(pattern, xml_data, re.DOTALL):
    result = re.split('<.+?>',i)
    print (result[1], result[3]) # Print Score

我得到了这个错误：

Traceback (most recent call last):
  File "C:\Users\Bright\Desktop\live_score.py", line 12, in <module>
   for i in re.findall(pattern, xml_data, re.DOTALL):
  File "C:\Python34\lib\re.py", line 206, in findall
    return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object

Answer 1

在您的第一个示例中 - 该网站正在大量加载其内容 javascript 因此我建议使用 selenium 来获取源代码。

您的代码应如下所示：

import bs4 as bs
from selenium import webdriver

url = 'http://www.livescore.com/'
browser = webdriver.Chrome()
browser.get(url)
sauce = browser.page_source
browser.quit()
soup = bs.BeautifulSoup(sauce,'lxml')

for div in soup.find('div', attrs={'data-type': 'container'}).find_all('div'):
    print(div.text)

对于第二个示例，正则表达式引擎 returns 出错，因为您请求中的 read() 函数给出字节数据类型，"re" 只接受字符串或 unicode。所以你只是没有 toypecast xml_data 到 str.

这是修改后的代码：

for i in re.findall(pattern, str(xml_data), re.DOTALL):
    result = re.split('<.+?>',i)
    print (result[1], result[3]) # Print Score

如何从比分直播中抓取足球成绩？

How to scrape football results from livescores?

python

urllib

beautifulsoup

web-scraping

python-3.4