非贪婪搜索字符串的开头
Non-greedy search for beginning of string
我有以下 link 需要提取:
[{"file":"https:\/\/www.rapidvideo.com\/loadthumb.php?v=FFIMB47EWD","kind":"thumbnails"}],
"sources": [
{"file":"https:\/\/www588.playercdn.net\/85\/1\/e_q8OBtv52BRyClYa_w0kw\/1496784287\/170512\/359E33j28Jo0ovY.mp4",
"label":"Standard (288p)","res":"288"},
{"file":"https:\/\/www726.playercdn.net\/86\/1\/q64Rsb8lG_CnxQAX6EZ2Sw\/1496784287\/170512\/371lbWrqzST1OOf.mp4"
我想提取以 mp4
结尾的 links
我的正则表达式如下:
"file":"(https\:.*?\.mp4)"
然而,我的匹配是错误的,因为第一个以 php 结尾的 link 被匹配。
我在这里练习Pythex.org。如何避免第一个 link?
我要解析的 html 页面的 link 是 https://www.rapidvideo.com/e/FFIMB47EWD
为什么还要使用正则表达式?这看起来像一个 JSON object/Python 字典,你可以遍历它并使用 str.endswith
.
>>> sources = {
... "sources": [
... {"file": "https:\/\/www588.playercdn.net\/85\/1\/e_q8OBtv52BRyClYa_w0kw\/1496784287\/170512\/359E33j28Jo0ovY.mp4",
... "label": "Standard (288p)","res":"288"},
... {"file": "https:\/\/www726.playercdn.net\/86\/1\/q64Rsb8lG_CnxQAX6EZ2Sw\/1496784287\/170512\/371lbWrqzST1OOf.mp4",
... "label": "Standard (288p)","res":"288"}
... ]
... }
>>> for item in sources['sources']:
... if item['file'].endswith('.mp4'):
... print(item['file'])
...
https:\/\/www588.playercdn.net\/85\/1\/e_q8OBtv52BRyClYa_w0kw\/1496784287\/170512\/359E33j28Jo0ovY.mp4
https:\/\/www726.playercdn.net\/86\/1\/q64Rsb8lG_CnxQAX6EZ2Sw\/1496784287\/170512\/371lbWrqzST1OOf.mp4
编辑:
在 javascript 加载后,link 似乎在 video
标签中可用。您可以使用无头浏览器,但我只是使用 selenium
完全加载页面,然后保存 html.
获得完整页面后 html,您可以使用 BeautifulSoup
而不是正则表达式来解析它。
Using regular expressions to parse HTML: why not?
from bs4 import BeautifulSoup
from selenium import webdriver
def extract_mp4_link(page_html):
soup = BeautifulSoup(page_html, 'lxml')
return soup.find('video')['src']
def get_page_html(url):
driver = webdriver.Chrome()
driver.get(url)
page_source = driver.page_source
driver.close()
return page_source
if __name__ == '__main__':
page_url = 'https://www.rapidvideo.com/e/FFIMB47EWD'
page_html = get_page_html(page_url)
print(extract_mp4_link(page_html))
我有以下 link 需要提取:
[{"file":"https:\/\/www.rapidvideo.com\/loadthumb.php?v=FFIMB47EWD","kind":"thumbnails"}],
"sources": [
{"file":"https:\/\/www588.playercdn.net\/85\/1\/e_q8OBtv52BRyClYa_w0kw\/1496784287\/170512\/359E33j28Jo0ovY.mp4",
"label":"Standard (288p)","res":"288"},
{"file":"https:\/\/www726.playercdn.net\/86\/1\/q64Rsb8lG_CnxQAX6EZ2Sw\/1496784287\/170512\/371lbWrqzST1OOf.mp4"
我想提取以 mp4
结尾的 links我的正则表达式如下:
"file":"(https\:.*?\.mp4)"
然而,我的匹配是错误的,因为第一个以 php 结尾的 link 被匹配。 我在这里练习Pythex.org。如何避免第一个 link? 我要解析的 html 页面的 link 是 https://www.rapidvideo.com/e/FFIMB47EWD
为什么还要使用正则表达式?这看起来像一个 JSON object/Python 字典,你可以遍历它并使用 str.endswith
.
>>> sources = {
... "sources": [
... {"file": "https:\/\/www588.playercdn.net\/85\/1\/e_q8OBtv52BRyClYa_w0kw\/1496784287\/170512\/359E33j28Jo0ovY.mp4",
... "label": "Standard (288p)","res":"288"},
... {"file": "https:\/\/www726.playercdn.net\/86\/1\/q64Rsb8lG_CnxQAX6EZ2Sw\/1496784287\/170512\/371lbWrqzST1OOf.mp4",
... "label": "Standard (288p)","res":"288"}
... ]
... }
>>> for item in sources['sources']:
... if item['file'].endswith('.mp4'):
... print(item['file'])
...
https:\/\/www588.playercdn.net\/85\/1\/e_q8OBtv52BRyClYa_w0kw\/1496784287\/170512\/359E33j28Jo0ovY.mp4
https:\/\/www726.playercdn.net\/86\/1\/q64Rsb8lG_CnxQAX6EZ2Sw\/1496784287\/170512\/371lbWrqzST1OOf.mp4
编辑:
在 javascript 加载后,link 似乎在 video
标签中可用。您可以使用无头浏览器,但我只是使用 selenium
完全加载页面,然后保存 html.
获得完整页面后 html,您可以使用 BeautifulSoup
而不是正则表达式来解析它。
Using regular expressions to parse HTML: why not?
from bs4 import BeautifulSoup
from selenium import webdriver
def extract_mp4_link(page_html):
soup = BeautifulSoup(page_html, 'lxml')
return soup.find('video')['src']
def get_page_html(url):
driver = webdriver.Chrome()
driver.get(url)
page_source = driver.page_source
driver.close()
return page_source
if __name__ == '__main__':
page_url = 'https://www.rapidvideo.com/e/FFIMB47EWD'
page_html = get_page_html(page_url)
print(extract_mp4_link(page_html))