如何从比分直播中抓取足球成绩?
How to scrape football results from livescores?
我有这个项目正在使用 python 3.4。我想抓取 livescore.com 以获得足球比分(结果),例如获取当天的所有比分(英格兰 2-2 挪威,法国 2-1 意大利等)。我正在用 python 3.4, windows 10 64bit os.
构建它
我试过两种方法,这是代码:
import bs4 as bs
import urllib.request
sauce = urllib.request.urlopen('http://www.livescore.com/').read()
soup = bs.BeautifulSoup(sauce,'lxml')
for div in soup.find_all('div', class_='container'):
print(div.text)
当我运行这段代码时,一只盒子小狗说:
IDLE's subprocess didn't make connection. Either IDLE can't start a subprocess or firewall software is blocking the connection.
我决定再写一个这是代码:
# Import Modules
import urllib.request
import re
# Downloading Live Score XML Code From Website and reading also
xml_data = urllib.request.urlopen('http://static.cricinfo.com/rss/livescores.xml').read()
# Pattern For Searching Score and link
pattern = "<item>(.*?)</item>"
# Finding Matches
for i in re.findall(pattern, xml_data, re.DOTALL):
result = re.split('<.+?>',i)
print (result[1], result[3]) # Print Score
我得到了这个错误:
Traceback (most recent call last):
File "C:\Users\Bright\Desktop\live_score.py", line 12, in <module>
for i in re.findall(pattern, xml_data, re.DOTALL):
File "C:\Python34\lib\re.py", line 206, in findall
return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object
在您的第一个示例中 - 该网站正在大量加载其内容 javascript 因此我建议使用 selenium 来获取源代码。
您的代码应如下所示:
import bs4 as bs
from selenium import webdriver
url = 'http://www.livescore.com/'
browser = webdriver.Chrome()
browser.get(url)
sauce = browser.page_source
browser.quit()
soup = bs.BeautifulSoup(sauce,'lxml')
for div in soup.find('div', attrs={'data-type': 'container'}).find_all('div'):
print(div.text)
对于第二个示例,正则表达式引擎 returns 出错,因为您请求中的 read()
函数给出字节数据类型,"re" 只接受字符串或 unicode。所以你只是没有 toypecast xml_data 到 str.
这是修改后的代码:
for i in re.findall(pattern, str(xml_data), re.DOTALL):
result = re.split('<.+?>',i)
print (result[1], result[3]) # Print Score
我有这个项目正在使用 python 3.4。我想抓取 livescore.com 以获得足球比分(结果),例如获取当天的所有比分(英格兰 2-2 挪威,法国 2-1 意大利等)。我正在用 python 3.4, windows 10 64bit os.
构建它我试过两种方法,这是代码:
import bs4 as bs
import urllib.request
sauce = urllib.request.urlopen('http://www.livescore.com/').read()
soup = bs.BeautifulSoup(sauce,'lxml')
for div in soup.find_all('div', class_='container'):
print(div.text)
当我运行这段代码时,一只盒子小狗说:
IDLE's subprocess didn't make connection. Either IDLE can't start a subprocess or firewall software is blocking the connection.
我决定再写一个这是代码:
# Import Modules
import urllib.request
import re
# Downloading Live Score XML Code From Website and reading also
xml_data = urllib.request.urlopen('http://static.cricinfo.com/rss/livescores.xml').read()
# Pattern For Searching Score and link
pattern = "<item>(.*?)</item>"
# Finding Matches
for i in re.findall(pattern, xml_data, re.DOTALL):
result = re.split('<.+?>',i)
print (result[1], result[3]) # Print Score
我得到了这个错误:
Traceback (most recent call last):
File "C:\Users\Bright\Desktop\live_score.py", line 12, in <module>
for i in re.findall(pattern, xml_data, re.DOTALL):
File "C:\Python34\lib\re.py", line 206, in findall
return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object
在您的第一个示例中 - 该网站正在大量加载其内容 javascript 因此我建议使用 selenium 来获取源代码。
您的代码应如下所示:
import bs4 as bs
from selenium import webdriver
url = 'http://www.livescore.com/'
browser = webdriver.Chrome()
browser.get(url)
sauce = browser.page_source
browser.quit()
soup = bs.BeautifulSoup(sauce,'lxml')
for div in soup.find('div', attrs={'data-type': 'container'}).find_all('div'):
print(div.text)
对于第二个示例,正则表达式引擎 returns 出错,因为您请求中的 read()
函数给出字节数据类型,"re" 只接受字符串或 unicode。所以你只是没有 toypecast xml_data 到 str.
这是修改后的代码:
for i in re.findall(pattern, str(xml_data), re.DOTALL):
result = re.split('<.+?>',i)
print (result[1], result[3]) # Print Score