从网页中查找特定日期?

Finding Specific Date from a webpage?

好的,所以我已经处理这个问题几天了,我尝试了多种方法,但我相信我最接近当前的实现。我希望从以下 url 中检索最后更新:日期:https://steamcommunity.com/sharedfiles/filedetails/changelog/2016338122

我不能保证它会在同一时间link相同,但最后一个数字会改变,并且会循环遍历多个页面以检索相同的日期。

这是我目前拥有的:

#!/usr/bin/env bash

## this is just a list of the mods i want to check.
activeModList=($(echo "$mods" | tr ',' '\n'))

for mod in "${activeModList[@]}"
do
   :
   modDirectory="modHTML/$mod.html"
   steamLink="https://steamcommunity.com/sharedfiles/filedetails/changelog/$mod"
   wget -O $modDirectory $steamLink
done

for mod in "${activeModList[@]}"
do
    :

    modDirectory="modHTML/$mod"
    modHTML="xmllint --nowarning --html --xpath "/html/body/div[1]/div[7]/div[4]/div[1]/div[4]/div[11]/div[1]/div[2]/div[1]" $modDirectory.html"

    lastUpdateTime=$(awk '/Update: /{p=1}p' "$modHTML")
    echo "$mod last updated: $lastUpdateTime"
done

现在为了让事情更清楚,$activeModList 包含一个 mod 数字数组来迭代。 目前它将 html 文件保存到特定文件夹。

然后我尝试使用 xmllint 和 awk 来解析网页中的日期。

值得注意的是,当我调用 xlint 命令时,我收到:

modHTML/928102085.html:294: HTML parser error : Unexpected end tag : b
re you sure you want to revert changes to your Workshop item back to <b>%1$s</b>
                                                                               ^
modHTML/928102085.html:426: HTML parser error : htmlParseEntityRef: no name
s item has been removed from the community because it violates Steam Community &
                                                                               ^
<div class="changelog headline">&#13;
                                                        Update: 15 Aug, 2021 @ 5:10am

现在我不能保证我不会每次都收到这样的警告/错误,因为我可能会遍历数百个与此类似的网页,所以我想知道我是否可以解析 xlint 的输出以仅检索最后的更新日期和时间。

非常感谢大家。

编辑:

lastUpdateTime 的输出产生了这些语法错误:

awk: fatal: cannot open file `xmllint --nowarning --html --xpath /html/body/div[1]/div[7]/div[4]/div[1]/div[4]/div[11]/div[1]/div[2]/div[1] modHTML/928102085.html' for reading (No such file or directory)
928102085 last updated:
awk: fatal: cannot open file `xmllint --nowarning --html --xpath /html/body/div[1]/div[7]/div[4]/div[1]/div[4]/div[11]/div[1]/div[2]/div[1] modHTML/731604991.html' for reading (No such file or directory)
731604991 last updated:
awk: fatal: cannot open file `xmllint --nowarning --html --xpath /html/body/div[1]/div[7]/div[4]/div[1]/div[4]/div[11]/div[1]/div[2]/div[1] modHTML/1404697612.html' for reading (No such file or directory)
1404697612 last updated:
awk: fatal: cannot open file `xmllint --nowarning --html --xpath /html/body/div[1]/div[7]/div[4]/div[1]/div[4]/div[11]/div[1]/div[2]/div[1] modHTML/618916953.html' for reading (No such file or directory)
618916953 last updated:
awk: fatal: cannot open file `xmllint --nowarning --html --xpath /html/body/div[1]/div[7]/div[4]/div[1]/div[4]/div[11]/div[1]/div[2]/div[1] modHTML/566885854.html' for reading (No such file or directory)
566885854 last updated:
awk: fatal: cannot open file `xmllint --nowarning --html --xpath /html/body/div[1]/div[7]/div[4]/div[1]/div[4]/div[11]/div[1]/div[2]/div[1] modHTML/924933745.html' for reading (No such file or directory)
924933745 last updated:
awk: fatal: cannot open file `xmllint --nowarning --html --xpath /html/body/div[1]/div[7]/div[4]/div[1]/div[4]/div[11]/div[1]/div[2]/div[1] modHTML/1609138312.html' for reading (No such file or directory)

当前代码的几个问题:

  • 默认情况下 awk 需要一个文件作为输入,但 modHTML 是一个(字符串)变量;要让 awk 处理变量,您可以使用 here-string 来模拟将字符串作为文件提供给 awk,例如:awk '/Update: /{p=1}p' <<< "$modHTML"
  • modHTML="xmllint --nowarning ..." 将字符串 xmllint --nowarning ... 分配给 modHTML,而您真正想要的是 运行 xmllint 调用并将结果存储在modHTML 变量,例如,modHTML=$(xmllint --nowarning ...)

将这些更改滚动到 OP 的当前代码中:

for mod in "${activeModList[@]}"
do
    modDirectory="modHTML/$mod"

    modHTML=$(xmllint --nowarning --html --xpath "/html/body/div[1]/div[7]/div[4]/div[1]/div[4]/div[11]/div[1]/div[2]/div[1]" "$modDirectory.html")

    lastUpdateTime=$(awk '/Update: /{p=1}p' <<< "$modHTML")

    # uncomment following line to assist with debugging; this will
    # show you exactly what's stored in the variables thus allowing
    # you to verify if your code is doing what you think it's doing

    # typeset -p modHTML lastUpdateTime

    echo "$mod last updated: $lastUpdateTime"
done

备注:

  • 我不使用 xmllint 所以我无法评论这是否是一个有效的调用,但至少建议的代码更改应该允许 OP 更接近期望的结果
  • 可能可以调整 awk 调用以提供更紧凑的答案,但我会将其留给 OP 处理(一旦我们克服了语法错误并开始生成实际输出)