如何从命令行以编程方式检索我的 SO 代表和徽章数量？

Question

原问题

我最初的尝试是运行 curl https://whosebug.com/users/5825294/enlico 并将结果通过管道传输到 sed/awk。然而，正如我经常读到的那样，sed 和 awk 并不是解析 HTML 代码的最佳工具。此外，如果我更改用户名，上述 URL 也会发生变化。

哦，这是我对 sed 的快速尝试，为了便于阅读，写成多行：

curl https://whosebug.com/users/5825294/enlico 2> /dev/null | sed -nE '
/title="reputation"/,/bronze badges/{
    /"reputation"/{
        N
        N
        s!.*>(.*)</.*!!p
    }
/badges/s/.*[^1-9]([1-9]+[0-9]*,*[0-9]* (gold|silver|bronze) badges).*//p
}'

打印

10,968
5 gold badges
27 silver badges
56 bronze badge

显然这个脚本严重依赖于特定 HTML 页面的特殊结构，最值得注意的例子是我运行 N 两次，因为我已经验证了声誉是文件中包含 "reputation".

的第一行下方两行

根据答案更新

几乎回答了我的问题。缺少的一点是我有 5 个金、27 个银和 56 个铜徽章，而不是 5、18、7。

在这方面，我注意到 18 是我拥有的银徽章的数量，如果我不考虑那些多次授予的徽章，因此我玩过 jq 并发现我可以查询 rank 旁边的 award_count，我认为我可以使用它来考虑多次授予的徽章。这种作品，在运行以下（fetch_user_badges来自Léa Gris的回答）生成正确数量的银徽章但错误数量的铜徽章的意义上：

$ fetch_user_badges Whosebug 5825294 | jq -r '
.items
| map({rank: .rank, count: .award_count})
| group_by(.rank)
| map([[.[0].rank],map(.count) | add])'

[
  [
    "bronze",
    22
  ],
  [
    "gold",
    5
  ],
  [
    "silver",
    27
  ]
]

有人知道这是为什么吗？

Answer 1

有几种方法可以做到这一点；我个人更喜欢用 xpath 和 xidel 这样的工具一起使用（虽然你也可以使用 xmlstarlet 等）

您可以使用

获得您的声誉分数

xidel https://whosebug.com/users/5825294/enlico  -e "//div[@title='reputation']/div/div[@class='grid--cell fs-title fc-dark']/text()"

同理，金牌数的获取方式为：

xidel https://whosebug.com/users/5825294/enlico  -e "//div[@class='grid ai-center s-badge s-badge__gold']//span[@class='grid grid__center fl1']/text()"

在第二个 xpath 表达式中将字符串 gold 更改为 silver 或 bronze 将为您提供其他两个类别。

Answer 2

使用 StackExchange API 和 jq 解析响应的完整示例。

#!/usr/bin/env bash

# This script fetches and prints some user info
# from a stack-site using the stackexchange's API

# Change this to the Whosebug's numerical user ID

STACK_UID=5825294
STACK_SITE='Whosebug'
STACK_API='https://api.stackexchange.com/2.2'

API_CACHE=~/.cache/stack_api

mkdir -p "$API_CACHE"

# Get a stack-site user using the stackexchange API and caches the result
# @Params:
# : the website (example Whosebug)
# : the numerical user ID
# @Output:
# &1: API Json reply
stack_api::user() {
  stack_site=
  stack_uid=

  cache_file="${API_CACHE}/${stack_site}-users-${stack_uid}.json"

  yesterday_ref="${API_CACHE}/yesterday.ref"
  touch -d yesterday "$yesterday_ref"

  # Expire cache
  [ "$cache_file" -ot "$yesterday_ref" ] && rm -f -- "$cache_file"

  # Call stack API only if no cached answer
  [ -f "$cache_file" ] || curl \
    --silent \
    --output "$cache_file" \
    --request GET \
    --url "${STACK_API}/users/${stack_uid}?site=${stack_site}"

  # Return cached answer
  zcat --force -- "$cache_file" 2>/dev/null
}

IFS=$'\n' read -r -d '' username reputation bronze silver gold < <(
  # Fetch user from a stack site
  stack_api::user "$STACK_SITE" "$STACK_UID" |

  # Parse the stack_api user data from the JSON response
  jq -r '
.items[0] |
  .display_name,
  .reputation,
  ( .badge_counts |
    .bronze,
    .silver,
    .gold
  )
  '
)

printf 'Badges from UserID %d %s on the %s website:\n\n' \
  $STACK_UID "$username" "$STACK_SITE"
printf 'Réputation: %6d\n' "$reputation"
printf 'Bronze:     %6d\n' "$bronze"
printf 'Silver:     %6d\n' "$silver"
printf 'Gold:       %6d\n' "$gold"

示例输出：

Badges from UserID 5825294 Enlico on the Whosebug website:

Reputation:  11144
Bronze:         56
Silver:         27
Gold:            5

Answer 3

as I've frequently read, sed and awk are not the best tools to parse HTML code.

没错。与其重复别人已经说过的话，不如说；看看：

Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms
RegEx match open tags except XHTML self-contained tags
How do I extract data from an HTML or XML file?

太糟糕了，最后一个网站已经过时了，因为要解析 HTML-源，我会随时选择瑞士刀工具 xidel！

HTML-来源

$ xidel -s "https://whosebug.com/users/5825294" -e '
  normalize-space(//div[@class="flex--item md:fl-auto"][1]),
  //div[@class="d-flex ai-center mb12"]/normalize-space(div[@class="flex--item fl1"])
'
14,999 reputation
5 gold badges
31 silver badges
68 bronze badges

Furthermore, the above URL changes if I change my user name.

如您所见，"https://whosebug.com/users/5825294" 也有效。
对于 curl -L, --location 将需要跟随重定向到“https://whosebug.com/users/5825294/enlico”。 xidel 自动执行此操作。

StackExchange API

同款瑞士刀工具也是JSON解析器：

$ xidel -s "https://api.stackexchange.com/2.2/users/5825294?site=Whosebug" -e '
  $json/(items)()/(
    reputation||" reputation",
    for $x in reverse((badge_counts)()) return
    join(((badge_counts)($x),$x,"badges"))
  )
'
14999 reputation
5 gold badges
31 silver badges
68 bronze badges

另见 this Xidel online tester（替代）中间步骤。

Answer 4

古人的智慧是do not parse HTML with regex，怎么样

curl https://whosebug.com/users/5825294/enlico -s | php -r '$d=new DOMDocument();@$d->loadHTML(stream_get_contents(STDIN));$xp=new DOMXPath($d);foreach($xp->query("//*[@id=\"user-card\"]//*[contains(@title,\"badges\")]") as $foo){echo $foo->getAttribute("title"),PHP_EOL;}echo preg_replace("/\s+/"," ",$xp->query("//*[@title=\"reputation\"]")->item(0)->textContent);'

5 gold badges
27 silver badges
56 bronze badges
 11,144 reputation

...

如何从命令行以编程方式检索我的 SO 代表和徽章数量？

How can I retrieve programmatically from command line my SO rep and number of badges?

html

bash

curl

html-parsing

stackexchange-api

原问题

根据答案更新

HTML-来源

StackExchange API