从网页中查找特定日期?
Finding Specific Date from a webpage?
好的,所以我已经处理这个问题几天了,我尝试了多种方法,但我相信我最接近当前的实现。我希望从以下 url 中检索最后更新:日期:https://steamcommunity.com/sharedfiles/filedetails/changelog/2016338122
我不能保证它会在同一时间link相同,但最后一个数字会改变,并且会循环遍历多个页面以检索相同的日期。
这是我目前拥有的:
#!/usr/bin/env bash
## this is just a list of the mods i want to check.
activeModList=($(echo "$mods" | tr ',' '\n'))
for mod in "${activeModList[@]}"
do
:
modDirectory="modHTML/$mod.html"
steamLink="https://steamcommunity.com/sharedfiles/filedetails/changelog/$mod"
wget -O $modDirectory $steamLink
done
for mod in "${activeModList[@]}"
do
:
modDirectory="modHTML/$mod"
modHTML="xmllint --nowarning --html --xpath "/html/body/div[1]/div[7]/div[4]/div[1]/div[4]/div[11]/div[1]/div[2]/div[1]" $modDirectory.html"
lastUpdateTime=$(awk '/Update: /{p=1}p' "$modHTML")
echo "$mod last updated: $lastUpdateTime"
done
现在为了让事情更清楚,$activeModList 包含一个 mod 数字数组来迭代。
目前它将 html 文件保存到特定文件夹。
然后我尝试使用 xmllint 和 awk 来解析网页中的日期。
值得注意的是,当我调用 xlint 命令时,我收到:
modHTML/928102085.html:294: HTML parser error : Unexpected end tag : b
re you sure you want to revert changes to your Workshop item back to <b>%1$s</b>
^
modHTML/928102085.html:426: HTML parser error : htmlParseEntityRef: no name
s item has been removed from the community because it violates Steam Community &
^
<div class="changelog headline">
Update: 15 Aug, 2021 @ 5:10am
现在我不能保证我不会每次都收到这样的警告/错误,因为我可能会遍历数百个与此类似的网页,所以我想知道我是否可以解析 xlint 的输出以仅检索最后的更新日期和时间。
非常感谢大家。
编辑:
lastUpdateTime 的输出产生了这些语法错误:
awk: fatal: cannot open file `xmllint --nowarning --html --xpath /html/body/div[1]/div[7]/div[4]/div[1]/div[4]/div[11]/div[1]/div[2]/div[1] modHTML/928102085.html' for reading (No such file or directory)
928102085 last updated:
awk: fatal: cannot open file `xmllint --nowarning --html --xpath /html/body/div[1]/div[7]/div[4]/div[1]/div[4]/div[11]/div[1]/div[2]/div[1] modHTML/731604991.html' for reading (No such file or directory)
731604991 last updated:
awk: fatal: cannot open file `xmllint --nowarning --html --xpath /html/body/div[1]/div[7]/div[4]/div[1]/div[4]/div[11]/div[1]/div[2]/div[1] modHTML/1404697612.html' for reading (No such file or directory)
1404697612 last updated:
awk: fatal: cannot open file `xmllint --nowarning --html --xpath /html/body/div[1]/div[7]/div[4]/div[1]/div[4]/div[11]/div[1]/div[2]/div[1] modHTML/618916953.html' for reading (No such file or directory)
618916953 last updated:
awk: fatal: cannot open file `xmllint --nowarning --html --xpath /html/body/div[1]/div[7]/div[4]/div[1]/div[4]/div[11]/div[1]/div[2]/div[1] modHTML/566885854.html' for reading (No such file or directory)
566885854 last updated:
awk: fatal: cannot open file `xmllint --nowarning --html --xpath /html/body/div[1]/div[7]/div[4]/div[1]/div[4]/div[11]/div[1]/div[2]/div[1] modHTML/924933745.html' for reading (No such file or directory)
924933745 last updated:
awk: fatal: cannot open file `xmllint --nowarning --html --xpath /html/body/div[1]/div[7]/div[4]/div[1]/div[4]/div[11]/div[1]/div[2]/div[1] modHTML/1609138312.html' for reading (No such file or directory)
当前代码的几个问题:
- 默认情况下
awk
需要一个文件作为输入,但 modHTML
是一个(字符串)变量;要让 awk
处理变量,您可以使用 here-string 来模拟将字符串作为文件提供给 awk
,例如:awk '/Update: /{p=1}p' <<< "$modHTML"
modHTML="xmllint --nowarning ..."
将字符串 xmllint --nowarning ...
分配给 modHTML
,而您真正想要的是 运行 xmllint
调用并将结果存储在modHTML
变量,例如,modHTML=$(xmllint --nowarning ...)
将这些更改滚动到 OP 的当前代码中:
for mod in "${activeModList[@]}"
do
modDirectory="modHTML/$mod"
modHTML=$(xmllint --nowarning --html --xpath "/html/body/div[1]/div[7]/div[4]/div[1]/div[4]/div[11]/div[1]/div[2]/div[1]" "$modDirectory.html")
lastUpdateTime=$(awk '/Update: /{p=1}p' <<< "$modHTML")
# uncomment following line to assist with debugging; this will
# show you exactly what's stored in the variables thus allowing
# you to verify if your code is doing what you think it's doing
# typeset -p modHTML lastUpdateTime
echo "$mod last updated: $lastUpdateTime"
done
备注:
- 我不使用
xmllint
所以我无法评论这是否是一个有效的调用,但至少建议的代码更改应该允许 OP 更接近期望的结果
- 可能可以调整
awk
调用以提供更紧凑的答案,但我会将其留给 OP 处理(一旦我们克服了语法错误并开始生成实际输出)
好的,所以我已经处理这个问题几天了,我尝试了多种方法,但我相信我最接近当前的实现。我希望从以下 url 中检索最后更新:日期:https://steamcommunity.com/sharedfiles/filedetails/changelog/2016338122
我不能保证它会在同一时间link相同,但最后一个数字会改变,并且会循环遍历多个页面以检索相同的日期。
这是我目前拥有的:
#!/usr/bin/env bash
## this is just a list of the mods i want to check.
activeModList=($(echo "$mods" | tr ',' '\n'))
for mod in "${activeModList[@]}"
do
:
modDirectory="modHTML/$mod.html"
steamLink="https://steamcommunity.com/sharedfiles/filedetails/changelog/$mod"
wget -O $modDirectory $steamLink
done
for mod in "${activeModList[@]}"
do
:
modDirectory="modHTML/$mod"
modHTML="xmllint --nowarning --html --xpath "/html/body/div[1]/div[7]/div[4]/div[1]/div[4]/div[11]/div[1]/div[2]/div[1]" $modDirectory.html"
lastUpdateTime=$(awk '/Update: /{p=1}p' "$modHTML")
echo "$mod last updated: $lastUpdateTime"
done
现在为了让事情更清楚,$activeModList 包含一个 mod 数字数组来迭代。 目前它将 html 文件保存到特定文件夹。
然后我尝试使用 xmllint 和 awk 来解析网页中的日期。
值得注意的是,当我调用 xlint 命令时,我收到:
modHTML/928102085.html:294: HTML parser error : Unexpected end tag : b
re you sure you want to revert changes to your Workshop item back to <b>%1$s</b>
^
modHTML/928102085.html:426: HTML parser error : htmlParseEntityRef: no name
s item has been removed from the community because it violates Steam Community &
^
<div class="changelog headline">
Update: 15 Aug, 2021 @ 5:10am
现在我不能保证我不会每次都收到这样的警告/错误,因为我可能会遍历数百个与此类似的网页,所以我想知道我是否可以解析 xlint 的输出以仅检索最后的更新日期和时间。
非常感谢大家。
编辑:
lastUpdateTime 的输出产生了这些语法错误:
awk: fatal: cannot open file `xmllint --nowarning --html --xpath /html/body/div[1]/div[7]/div[4]/div[1]/div[4]/div[11]/div[1]/div[2]/div[1] modHTML/928102085.html' for reading (No such file or directory)
928102085 last updated:
awk: fatal: cannot open file `xmllint --nowarning --html --xpath /html/body/div[1]/div[7]/div[4]/div[1]/div[4]/div[11]/div[1]/div[2]/div[1] modHTML/731604991.html' for reading (No such file or directory)
731604991 last updated:
awk: fatal: cannot open file `xmllint --nowarning --html --xpath /html/body/div[1]/div[7]/div[4]/div[1]/div[4]/div[11]/div[1]/div[2]/div[1] modHTML/1404697612.html' for reading (No such file or directory)
1404697612 last updated:
awk: fatal: cannot open file `xmllint --nowarning --html --xpath /html/body/div[1]/div[7]/div[4]/div[1]/div[4]/div[11]/div[1]/div[2]/div[1] modHTML/618916953.html' for reading (No such file or directory)
618916953 last updated:
awk: fatal: cannot open file `xmllint --nowarning --html --xpath /html/body/div[1]/div[7]/div[4]/div[1]/div[4]/div[11]/div[1]/div[2]/div[1] modHTML/566885854.html' for reading (No such file or directory)
566885854 last updated:
awk: fatal: cannot open file `xmllint --nowarning --html --xpath /html/body/div[1]/div[7]/div[4]/div[1]/div[4]/div[11]/div[1]/div[2]/div[1] modHTML/924933745.html' for reading (No such file or directory)
924933745 last updated:
awk: fatal: cannot open file `xmllint --nowarning --html --xpath /html/body/div[1]/div[7]/div[4]/div[1]/div[4]/div[11]/div[1]/div[2]/div[1] modHTML/1609138312.html' for reading (No such file or directory)
当前代码的几个问题:
- 默认情况下
awk
需要一个文件作为输入,但modHTML
是一个(字符串)变量;要让awk
处理变量,您可以使用 here-string 来模拟将字符串作为文件提供给awk
,例如:awk '/Update: /{p=1}p' <<< "$modHTML"
modHTML="xmllint --nowarning ..."
将字符串xmllint --nowarning ...
分配给modHTML
,而您真正想要的是 运行xmllint
调用并将结果存储在modHTML
变量,例如,modHTML=$(xmllint --nowarning ...)
将这些更改滚动到 OP 的当前代码中:
for mod in "${activeModList[@]}"
do
modDirectory="modHTML/$mod"
modHTML=$(xmllint --nowarning --html --xpath "/html/body/div[1]/div[7]/div[4]/div[1]/div[4]/div[11]/div[1]/div[2]/div[1]" "$modDirectory.html")
lastUpdateTime=$(awk '/Update: /{p=1}p' <<< "$modHTML")
# uncomment following line to assist with debugging; this will
# show you exactly what's stored in the variables thus allowing
# you to verify if your code is doing what you think it's doing
# typeset -p modHTML lastUpdateTime
echo "$mod last updated: $lastUpdateTime"
done
备注:
- 我不使用
xmllint
所以我无法评论这是否是一个有效的调用,但至少建议的代码更改应该允许 OP 更接近期望的结果 - 可能可以调整
awk
调用以提供更紧凑的答案,但我会将其留给 OP 处理(一旦我们克服了语法错误并开始生成实际输出)