Python(3) - 将来自 soup.find_all 的输出的特定 link 存储在变量中

Question

我正在开发网页抓取功能。我想从以下位置找到以“.img.html”结尾的最新下载 link：

"https://dl.twrp.me/gauguin/"

并将此 link 存储在一个变量中，而不仅仅是打印它。

到目前为止我的代码：

from bs4 import BeautifulSoup
from urllib.request import urlopen

url = "https://dl.twrp.me/gauguin/"

html_doc = urlopen(url)
# defining html link (twrp...)

soup = BeautifulSoup(html_doc, "html.parser")

for link in soup.find_all('a'):
    links = (link.get('href'))
print(links)

我的输出：

https://twrp.me
#
https://twrp.me/about/
https://twrp.me/contactus/
https://twrp.me/Devices/
https://twrp.me/FAQ/
/public.asc
/gauguin/twrp-3.5.2_10-0-gauguin.img.html
/gauguin/twrp-3.5.1_10-0-gauguin.img.html
/gauguin/twrp-3.5.0_10-0-gauguin.img.html
https://twrp.me/terms/termsofservice.html
https://twrp.me/terms/cookiepolicy.html
https://github.com/TeamWin

所以我的目标是过滤这个输出所以我只有那些link（最新的）：

/gauguin/twrp-3.5.2_10-0-gauguin.img.html

存储在一个变量中，所以我可以稍后调用这个变量，甚至可以直接用 wget 下载它。

Answer 1

阅读您的查询后，我了解到您正在寻找一种方法来获取以“.img.html”结尾的链接，以便您将来可以使用它们。下面的代码将提取所有目标链接并存储在一个 python 列表中，以后可以轻松使用。

你可以试试这个：

from bs4 import BeautifulSoup
from urllib.request import urlopen

url = "https://dl.twrp.me/gauguin/"

html_doc = urlopen(url)
# defining html link (twrp...)
soup = BeautifulSoup(html_doc, "html.parser")

links = []

for link in soup.find_all('a'):
    links.append((link.get('href')))

# target strings variable will contain all the links that end with .img.html 

target_strings =[]
for i in links:
    if '.img.html' in i:
        target_strings.append(i)

# and if needed in the future you can extract a single element from the list

Answer 2

当您遍历所有 <a> 元素时，您可以使用 str.endswith() 函数来检查 URL 是否以 .img.html 结尾。执行此操作以将所有 URL 提取到一个列表中：

urls = []
for link in soup.find_all('a'):
    url = link.get('href')
    if url.endswith('.img.html')
        urls.append(url)

其中给出了如下列表：

urls = ['/gauguin/twrp-3.5.2_10-0-gauguin.img.html',
'/gauguin/twrp-3.5.1_10-0-gauguin.img.html',
'/gauguin/twrp-3.5.0_10-0-gauguin.img.html']

接下来，这取决于您的 URL 中的 version-specifiers 是否保证跟在 lexicographic order 之后，即进行简单的字符串比较是否会得到最新的？如果它们都具有相同的格式，其中版本号的每个部分中的位数对于不同的字符串保持相同，这通常是正确的。您在示例中显示的那些满足此要求。

如果是这样，简单的做

max(urls)

这给出了

'/gauguin/twrp-3.5.2_10-0-gauguin.img.html'

如果不是这种情况，（例如，如果你有 '/gauguin/twrp-3.15.2_10-0-gauguin.img.html'，这是数字 > 3.5.2 但不是字典顺序）你将不得不从你的文件中解析出版本号字符串，可能使用正则表达式，并比较这些版本号。您可以使用 max() 函数的 key 参数来执行此操作（敬请期待，我正在为此编辑我的答案）。

假设您的版本号采用 <numbers>.<numbers>.<numbers>_<numbers>-<numbers> 格式。您将使用以下正则表达式 (try it online):

\d+\.\d+\.\d+_\d+-\d+

Explanation
\d+    : One or more digits
\.     : The . character

要将它与 max() 一起使用，您可以编写一个从文件名中提取版本号的函数：

import re

def extract_version(filename):
    # e.g. filename = '/gauguin/twrp-3.5.2_10-0-gauguin.img.html'
    
    match = re.search(r"(\d+)\.(\d+)\.(\d+)_(\d+)-(\d+)", filename)
    # e.g. match = <re.Match object; span=(14, 24), match='3.5.2_10-0'>
    # e.g. match.groups() = ('3', '5', '2', '10', '0')

    if match is not None:
        return tuple(int(m) for m in match.groups()) # Convert match to tuple of integer for correct comparison

    return tuple() # if match is none, return an empty tuple

我对正则表达式所做的一个修改是将 \d+ 括在括号中。括号使它成为一个捕获组，因此所有数字部分都被捕获为单独的组。然后，re.search() returns 一个匹配对象，它的 .groups() 方法给出一个包含数字的元组，但作为字符串。所以我们需要在返回之前将这些字符串转换为整数。

然后，使用该函数作为 max() 的 key 参数：

max(urls, key=extract_version)

Answer 3

我假设您想要 link 最近的日期。因此，您需要同时捕获 URL 和每个 link 的日期。然后可以将其转换为 datetime object 并添加到列表中。

找到所有 URL 后，列表可以很容易地按日期顺序排序，最新的在前。然后可以使用最新的URL来下载img文件。

例如：

from bs4 import BeautifulSoup
import requests
from datetime import datetime

base_url = "https://dl.twrp.me"
req = requests.get(f"{base_url}/gauguin")
soup = BeautifulSoup(req.content, "html.parser")

urls = []

for a in soup.find_all('a', href=True):
    link = a['href']
    
    if link.endswith('.img.html'):
        date_text = a.find_next('em').get_text(strip=True)
        date_dt = datetime.strptime(date_text, "%Y-%m-%d %H:%M:%S %Z")
        urls.append([date_dt, link])

latest = sorted(urls, reverse=True)[0][1]       # choose the latest url

# Download the latest img file
url_img = base_url + latest.split('.html')[0]
filename = url_img.split('/')[-1]

with requests.get(url_img, stream=True, headers={'Referer' : base_url + latest}) as req_img:
    with open(filename, 'wb') as f_img:
        for chunk in req_img.iter_content(chunk_size=2**15): 
            f_img.write(chunk)

如果更改命名或编号方案，此方法可能具有仍然有效的优势。添加了referer header以避免网站返回下载页面HTML。

这导致下载 131,072 KB

注意，如果您希望忽略日期而只按版本号排序，请使用以下方法：

from bs4 import BeautifulSoup
import requests
from datetime import datetime
import re

base_url = "https://dl.twrp.me"
req = requests.get(f"{base_url}/gauguin")
soup = BeautifulSoup(req.content, "html.parser")

urls = []

for a in soup.find_all('a', href=True):
    link = a['href']
    
    if link.endswith('.img.html'):
        version = re.findall(r'(\d+)', link)
        urls.append([version, link])

latest = sorted(urls, reverse=True)[0][1]       # choose the latest url

Python(3) - 将来自 soup.find_all 的输出的特定 link 存储在变量中

Python(3) - store one specific link of output from soup.find_all in variable

python

wget

beautifulsoup

web-scraping