哪个块代表 WARC-Block-Digest?

Which block represents a WARC-Block-Digest?

在下面的第 09 行有这样一行:WARC-Block-Digest: sha1:CLODKYDXCHPVOJMJWHJVT3EJJDKI2RTQ

Line 01: WARC/1.0
Line 02: WARC-Type: request
Line 03: WARC-Target-URI: https://climate.nasa.gov/vital-signs/carbon-dioxide/
Line 04: Content-Type: application/http;msgtype=request
Line 05: WARC-Date: 2018-11-03T17:20:02Z
Line 06: WARC-Record-ID: <urn:uuid:e44bc1ea-61a1-4200-b94f-60042456f638>
Line 07: WARC-IP-Address: 54.230.195.16
Line 08: WARC-Warcinfo-ID: <urn:uuid:6d14bf1d-0ef7-4f03-9de2-e578d105d3cb>
Line 09: WARC-Block-Digest: sha1:CLODKYDXCHPVOJMJWHJVT3EJJDKI2RTQ
Line 10: Content-Length: 141
Line 11:
Line 12: GET /vital-signs/carbon-dioxide/ HTTP/1.1
Line 13: User-Agent: Wget/1.15 (linux-gnu)
Line 14: Accept: */*
Line 15: Host: climate.nasa.gov
Line 16: Connection: Keep-Alive

WARC 的规范说 The WARC-Block-Digest is an optional parameter indicating the algorithm name and calculated value of a digest applied to the full block of the record.

我一直在想弄清楚 full block of the record 指的是什么。是第 11 到 16 行吗?还是 12 到 16 号线?还是第 1 到 16 行(没有第 9 行)?我试过散列这些可能性,但无法获得上面的 sha1(base 32)值。

一条 WARC 记录的 HTTP GET 请求包含三个部分(参见 WARC spec):

  1. WARC header
  2. HTTP 请求header
  3. 空的负载(注意:POST 请求将包含 non-empty 负载)

记录的有效负载摘要是空字符串的 base32 编码 SHA-1。使用 Linux command-line 工具的证明:

$> echo -n "" | openssl dgst -binary -sha1 | base32
3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ

WARC 记录的格式为:

warc-record  = header CRLF
               block CRLF CRLF

(参见 WARC spec: record model

“完整”块应包括尾随 \r\n\r\n 之前的所有内容。这意味着第 11 到 17 行。注意:HTTP GET request 也以 \r\n\r\n 结尾(尾随空白行):

$> cat request 
GET /vital-signs/carbon-dioxide/ HTTP/1.1
User-Agent: Wget/1.15 (linux-gnu)
Accept: */*
Host: climate.nasa.gov
Connection: Keep-Alive

$> tail -n2 request | hexdump -C
00000000  43 6f 6e 6e 65 63 74 69  6f 6e 3a 20 4b 65 65 70  |Connection: Keep|
00000010  2d 41 6c 69 76 65 0d 0a  0d 0a                    |-Alive....|
0000001a
$> cat request | openssl dgst -binary -sha1 | base32
CLODKYDXCHPVOJMJWHJVT3EJJDKI2RTQ