如何从命令行以编程方式检索我的 SO 代表和徽章数量?
How can I retrieve programmatically from command line my SO rep and number of badges?
原问题
我最初的尝试是 运行 curl https://whosebug.com/users/5825294/enlico
并将结果通过管道传输到 sed
/awk
。然而,正如我经常读到的那样,sed
和 awk
并不是解析 HTML 代码的最佳工具。此外,如果我更改用户名,上述 URL 也会发生变化。
哦,这是我对 sed
的快速尝试,为了便于阅读,写成多行:
curl https://whosebug.com/users/5825294/enlico 2> /dev/null | sed -nE '
/title="reputation"/,/bronze badges/{
/"reputation"/{
N
N
s!.*>(.*)</.*!!p
}
/badges/s/.*[^1-9]([1-9]+[0-9]*,*[0-9]* (gold|silver|bronze) badges).*//p
}'
打印
10,968
5 gold badges
27 silver badges
56 bronze badge
显然这个脚本严重依赖于特定 HTML 页面的特殊结构,最值得注意的例子是我 运行 N
两次,因为我已经验证了声誉是文件中包含 "reputation"
.
的第一行下方两行
根据答案更新
几乎回答了我的问题。缺少的一点是我有 5 个金、27 个银和 56 个铜徽章,而不是 5、18、7。
在这方面,我注意到 18 是我拥有的银徽章的数量,如果我不考虑那些多次授予的徽章,因此我玩过 jq
并发现我可以查询 rank
旁边的 award_count
,我认为我可以使用它来考虑多次授予的徽章。这种作品,在运行以下(fetch_user_badges
来自Léa Gris的回答)生成正确数量的银徽章但错误数量的铜徽章的意义上:
$ fetch_user_badges Whosebug 5825294 | jq -r '
.items
| map({rank: .rank, count: .award_count})
| group_by(.rank)
| map([[.[0].rank],map(.count) | add])'
[
[
"bronze",
22
],
[
"gold",
5
],
[
"silver",
27
]
]
有人知道这是为什么吗?
有几种方法可以做到这一点;我个人更喜欢用 xpath 和 xidel 这样的工具一起使用(虽然你也可以使用 xmlstarlet 等)
您可以使用
获得您的声誉分数
xidel https://whosebug.com/users/5825294/enlico -e "//div[@title='reputation']/div/div[@class='grid--cell fs-title fc-dark']/text()"
同理,金牌数的获取方式为:
xidel https://whosebug.com/users/5825294/enlico -e "//div[@class='grid ai-center s-badge s-badge__gold']//span[@class='grid grid__center fl1']/text()"
在第二个 xpath 表达式中将字符串 gold
更改为 silver
或 bronze
将为您提供其他两个类别。
使用 StackExchange API 和 jq 解析响应的完整示例。
#!/usr/bin/env bash
# This script fetches and prints some user info
# from a stack-site using the stackexchange's API
# Change this to the Whosebug's numerical user ID
STACK_UID=5825294
STACK_SITE='Whosebug'
STACK_API='https://api.stackexchange.com/2.2'
API_CACHE=~/.cache/stack_api
mkdir -p "$API_CACHE"
# Get a stack-site user using the stackexchange API and caches the result
# @Params:
# : the website (example Whosebug)
# : the numerical user ID
# @Output:
# &1: API Json reply
stack_api::user() {
stack_site=
stack_uid=
cache_file="${API_CACHE}/${stack_site}-users-${stack_uid}.json"
yesterday_ref="${API_CACHE}/yesterday.ref"
touch -d yesterday "$yesterday_ref"
# Expire cache
[ "$cache_file" -ot "$yesterday_ref" ] && rm -f -- "$cache_file"
# Call stack API only if no cached answer
[ -f "$cache_file" ] || curl \
--silent \
--output "$cache_file" \
--request GET \
--url "${STACK_API}/users/${stack_uid}?site=${stack_site}"
# Return cached answer
zcat --force -- "$cache_file" 2>/dev/null
}
IFS=$'\n' read -r -d '' username reputation bronze silver gold < <(
# Fetch user from a stack site
stack_api::user "$STACK_SITE" "$STACK_UID" |
# Parse the stack_api user data from the JSON response
jq -r '
.items[0] |
.display_name,
.reputation,
( .badge_counts |
.bronze,
.silver,
.gold
)
'
)
printf 'Badges from UserID %d %s on the %s website:\n\n' \
$STACK_UID "$username" "$STACK_SITE"
printf 'Réputation: %6d\n' "$reputation"
printf 'Bronze: %6d\n' "$bronze"
printf 'Silver: %6d\n' "$silver"
printf 'Gold: %6d\n' "$gold"
示例输出:
Badges from UserID 5825294 Enlico on the Whosebug website:
Reputation: 11144
Bronze: 56
Silver: 27
Gold: 5
as I've frequently read, sed
and awk
are not the best tools to parse HTML code.
没错。与其重复别人已经说过的话,不如说;看看:
- Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms
- RegEx match open tags except XHTML self-contained tags
- How do I extract data from an HTML or XML file?
太糟糕了,最后一个网站已经过时了,因为要解析 HTML-源,我会随时选择瑞士刀工具 xidel!
HTML-来源
$ xidel -s "https://whosebug.com/users/5825294" -e '
normalize-space(//div[@class="flex--item md:fl-auto"][1]),
//div[@class="d-flex ai-center mb12"]/normalize-space(div[@class="flex--item fl1"])
'
14,999 reputation
5 gold badges
31 silver badges
68 bronze badges
Furthermore, the above URL changes if I change my user name.
如您所见,"https://whosebug.com/users/5825294"
也有效。
对于 curl
-L, --location
将需要跟随重定向到“https://whosebug.com/users/5825294/enlico”。 xidel
自动执行此操作。
StackExchange API
同款瑞士刀工具也是JSON解析器:
$ xidel -s "https://api.stackexchange.com/2.2/users/5825294?site=Whosebug" -e '
$json/(items)()/(
reputation||" reputation",
for $x in reverse((badge_counts)()) return
join(((badge_counts)($x),$x,"badges"))
)
'
14999 reputation
5 gold badges
31 silver badges
68 bronze badges
另见 this Xidel online tester(替代)中间步骤。
古人的智慧是do not parse HTML with regex,怎么样
curl https://whosebug.com/users/5825294/enlico -s | php -r '$d=new DOMDocument();@$d->loadHTML(stream_get_contents(STDIN));$xp=new DOMXPath($d);foreach($xp->query("//*[@id=\"user-card\"]//*[contains(@title,\"badges\")]") as $foo){echo $foo->getAttribute("title"),PHP_EOL;}echo preg_replace("/\s+/"," ",$xp->query("//*[@title=\"reputation\"]")->item(0)->textContent);'
5 gold badges
27 silver badges
56 bronze badges
11,144 reputation
...
原问题
我最初的尝试是 运行 curl https://whosebug.com/users/5825294/enlico
并将结果通过管道传输到 sed
/awk
。然而,正如我经常读到的那样,sed
和 awk
并不是解析 HTML 代码的最佳工具。此外,如果我更改用户名,上述 URL 也会发生变化。
哦,这是我对 sed
的快速尝试,为了便于阅读,写成多行:
curl https://whosebug.com/users/5825294/enlico 2> /dev/null | sed -nE '
/title="reputation"/,/bronze badges/{
/"reputation"/{
N
N
s!.*>(.*)</.*!!p
}
/badges/s/.*[^1-9]([1-9]+[0-9]*,*[0-9]* (gold|silver|bronze) badges).*//p
}'
打印
10,968
5 gold badges
27 silver badges
56 bronze badge
显然这个脚本严重依赖于特定 HTML 页面的特殊结构,最值得注意的例子是我 运行 N
两次,因为我已经验证了声誉是文件中包含 "reputation"
.
根据答案更新
在这方面,我注意到 18 是我拥有的银徽章的数量,如果我不考虑那些多次授予的徽章,因此我玩过 jq
并发现我可以查询 rank
旁边的 award_count
,我认为我可以使用它来考虑多次授予的徽章。这种作品,在运行以下(fetch_user_badges
来自Léa Gris的回答)生成正确数量的银徽章但错误数量的铜徽章的意义上:
$ fetch_user_badges Whosebug 5825294 | jq -r '
.items
| map({rank: .rank, count: .award_count})
| group_by(.rank)
| map([[.[0].rank],map(.count) | add])'
[
[
"bronze",
22
],
[
"gold",
5
],
[
"silver",
27
]
]
有人知道这是为什么吗?
有几种方法可以做到这一点;我个人更喜欢用 xpath 和 xidel 这样的工具一起使用(虽然你也可以使用 xmlstarlet 等)
您可以使用
获得您的声誉分数xidel https://whosebug.com/users/5825294/enlico -e "//div[@title='reputation']/div/div[@class='grid--cell fs-title fc-dark']/text()"
同理,金牌数的获取方式为:
xidel https://whosebug.com/users/5825294/enlico -e "//div[@class='grid ai-center s-badge s-badge__gold']//span[@class='grid grid__center fl1']/text()"
在第二个 xpath 表达式中将字符串 gold
更改为 silver
或 bronze
将为您提供其他两个类别。
使用 StackExchange API 和 jq 解析响应的完整示例。
#!/usr/bin/env bash
# This script fetches and prints some user info
# from a stack-site using the stackexchange's API
# Change this to the Whosebug's numerical user ID
STACK_UID=5825294
STACK_SITE='Whosebug'
STACK_API='https://api.stackexchange.com/2.2'
API_CACHE=~/.cache/stack_api
mkdir -p "$API_CACHE"
# Get a stack-site user using the stackexchange API and caches the result
# @Params:
# : the website (example Whosebug)
# : the numerical user ID
# @Output:
# &1: API Json reply
stack_api::user() {
stack_site=
stack_uid=
cache_file="${API_CACHE}/${stack_site}-users-${stack_uid}.json"
yesterday_ref="${API_CACHE}/yesterday.ref"
touch -d yesterday "$yesterday_ref"
# Expire cache
[ "$cache_file" -ot "$yesterday_ref" ] && rm -f -- "$cache_file"
# Call stack API only if no cached answer
[ -f "$cache_file" ] || curl \
--silent \
--output "$cache_file" \
--request GET \
--url "${STACK_API}/users/${stack_uid}?site=${stack_site}"
# Return cached answer
zcat --force -- "$cache_file" 2>/dev/null
}
IFS=$'\n' read -r -d '' username reputation bronze silver gold < <(
# Fetch user from a stack site
stack_api::user "$STACK_SITE" "$STACK_UID" |
# Parse the stack_api user data from the JSON response
jq -r '
.items[0] |
.display_name,
.reputation,
( .badge_counts |
.bronze,
.silver,
.gold
)
'
)
printf 'Badges from UserID %d %s on the %s website:\n\n' \
$STACK_UID "$username" "$STACK_SITE"
printf 'Réputation: %6d\n' "$reputation"
printf 'Bronze: %6d\n' "$bronze"
printf 'Silver: %6d\n' "$silver"
printf 'Gold: %6d\n' "$gold"
示例输出:
Badges from UserID 5825294 Enlico on the Whosebug website:
Reputation: 11144
Bronze: 56
Silver: 27
Gold: 5
as I've frequently read,
sed
andawk
are not the best tools to parse HTML code.
没错。与其重复别人已经说过的话,不如说;看看:
- Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms
- RegEx match open tags except XHTML self-contained tags
- How do I extract data from an HTML or XML file?
太糟糕了,最后一个网站已经过时了,因为要解析 HTML-源,我会随时选择瑞士刀工具 xidel!
HTML-来源
$ xidel -s "https://whosebug.com/users/5825294" -e '
normalize-space(//div[@class="flex--item md:fl-auto"][1]),
//div[@class="d-flex ai-center mb12"]/normalize-space(div[@class="flex--item fl1"])
'
14,999 reputation
5 gold badges
31 silver badges
68 bronze badges
Furthermore, the above URL changes if I change my user name.
如您所见,"https://whosebug.com/users/5825294"
也有效。
对于 curl
-L, --location
将需要跟随重定向到“https://whosebug.com/users/5825294/enlico”。 xidel
自动执行此操作。
StackExchange API
同款瑞士刀工具也是JSON解析器:
$ xidel -s "https://api.stackexchange.com/2.2/users/5825294?site=Whosebug" -e '
$json/(items)()/(
reputation||" reputation",
for $x in reverse((badge_counts)()) return
join(((badge_counts)($x),$x,"badges"))
)
'
14999 reputation
5 gold badges
31 silver badges
68 bronze badges
另见 this Xidel online tester(替代)中间步骤。
古人的智慧是do not parse HTML with regex,怎么样
curl https://whosebug.com/users/5825294/enlico -s | php -r '$d=new DOMDocument();@$d->loadHTML(stream_get_contents(STDIN));$xp=new DOMXPath($d);foreach($xp->query("//*[@id=\"user-card\"]//*[contains(@title,\"badges\")]") as $foo){echo $foo->getAttribute("title"),PHP_EOL;}echo preg_replace("/\s+/"," ",$xp->query("//*[@title=\"reputation\"]")->item(0)->textContent);'
5 gold badges
27 silver badges
56 bronze badges
11,144 reputation
...