保存有问题的网页并导入回 Python
Save troublesome webpage and import back into Python
我试图从各种页面中提取一些信息,但有点费力。这显示了我的挑战:
import requests
from lxml import html
url = "https://www.soccer24.com/match/C4RB2hO0/#match-summary"
response = requests.get(url)
print(response.content)
如果将输出复制到记事本中,则在输出的任何位置都找不到值“9.20”(网页右下角的 A 队赔率)。但是,如果您打开网页,另存为,然后像这样将其导入回 Python,您可以找到并提取 9.20 值:
with open(r'HUL 1-7 TOT _ Hull - Tottenham _ Match Summary.html', "r") as f:
page = f.read()
tree = html.fromstring(page)
output = tree.xpath('//*[@id="default-odds"]/tbody/tr/td[2]/span/span[2]/span/text()') #the xpath for the TeamA odds or the 9.20 value
output # ['9.20']
不确定为什么这个变通方法有效,但这超出了我的理解范围。所以我想做的是将网页保存到我的本地驱动器并在 Python 中打开它,如上所述,然后从那里继续。但是如何在 Python 中复制另存为?这不起作用:
import urllib.request
response = urllib.request.urlopen(url)
webContent = response.read().decode('utf-8')
f = open('HUL 1-7 TOT _ Hull - Tottenham _ Match Summary.html', 'w')
f.write(webContent)
f.flush()
f.close()
它给了我一个网页,但它只是原始页面的一小部分...?
正如@Pedro Lobito 所说。页面内容由 javascript
生成。为此,您需要一个可以 运行 JavaScript 的模块。我会选择requests_html
或selenium
。
Requests_html
from requests_html import HTMLSession
url = "https://www.soccer24.com/match/C4RB2hO0/#match-summary"
session = HTMLSession()
response = session.get(url)
response.html.render()
result = response.html.xpath('//*[@id="default-odds"]/tbody/tr/td[2]/span/span[2]/span/text()')
print(result)
#['9.20']
硒
from selenium import webdriver
from lxml import html
url = "https://www.soccer24.com/match/C4RB2hO0/#match-summary"
dr = webdriver.Chrome()
try:
dr.get(url)
tree = html.fromstring(dr.page_source)
''' use it when browser closes before loading succeeds
# https://selenium-python.readthedocs.io/waits.html
WebDriverWait(dr, 10).until(
EC.presence_of_element_located((By.ID, "myDynamicElement"))
)
'''
output = tree.xpath('//*[@id="default-odds"]/tbody/tr/td[2]/span/span[2]/span/text()') #the xpath for the TeamA odds or the 9.20 value
print(output)
except Exception as e:
raise e
finally:
dr.close()
#['9.20']
我试图从各种页面中提取一些信息,但有点费力。这显示了我的挑战:
import requests
from lxml import html
url = "https://www.soccer24.com/match/C4RB2hO0/#match-summary"
response = requests.get(url)
print(response.content)
如果将输出复制到记事本中,则在输出的任何位置都找不到值“9.20”(网页右下角的 A 队赔率)。但是,如果您打开网页,另存为,然后像这样将其导入回 Python,您可以找到并提取 9.20 值:
with open(r'HUL 1-7 TOT _ Hull - Tottenham _ Match Summary.html', "r") as f:
page = f.read()
tree = html.fromstring(page)
output = tree.xpath('//*[@id="default-odds"]/tbody/tr/td[2]/span/span[2]/span/text()') #the xpath for the TeamA odds or the 9.20 value
output # ['9.20']
不确定为什么这个变通方法有效,但这超出了我的理解范围。所以我想做的是将网页保存到我的本地驱动器并在 Python 中打开它,如上所述,然后从那里继续。但是如何在 Python 中复制另存为?这不起作用:
import urllib.request
response = urllib.request.urlopen(url)
webContent = response.read().decode('utf-8')
f = open('HUL 1-7 TOT _ Hull - Tottenham _ Match Summary.html', 'w')
f.write(webContent)
f.flush()
f.close()
它给了我一个网页,但它只是原始页面的一小部分...?
正如@Pedro Lobito 所说。页面内容由 javascript
生成。为此,您需要一个可以 运行 JavaScript 的模块。我会选择requests_html
或selenium
。
Requests_html
from requests_html import HTMLSession
url = "https://www.soccer24.com/match/C4RB2hO0/#match-summary"
session = HTMLSession()
response = session.get(url)
response.html.render()
result = response.html.xpath('//*[@id="default-odds"]/tbody/tr/td[2]/span/span[2]/span/text()')
print(result)
#['9.20']
硒
from selenium import webdriver
from lxml import html
url = "https://www.soccer24.com/match/C4RB2hO0/#match-summary"
dr = webdriver.Chrome()
try:
dr.get(url)
tree = html.fromstring(dr.page_source)
''' use it when browser closes before loading succeeds
# https://selenium-python.readthedocs.io/waits.html
WebDriverWait(dr, 10).until(
EC.presence_of_element_located((By.ID, "myDynamicElement"))
)
'''
output = tree.xpath('//*[@id="default-odds"]/tbody/tr/td[2]/span/span[2]/span/text()') #the xpath for the TeamA odds or the 9.20 value
print(output)
except Exception as e:
raise e
finally:
dr.close()
#['9.20']