如何从 url 中读取一个短语以及 SWI-Prolog 的“open_http/2”提供什么样的流？

Question

我正在使用 SWI-Prolog library(http/http_open)。根据 the docs、"After [http_open(Url, Stream, [])] succeeds the data can be read from Stream." 因此，我想也许我可以通过使用 library(pure_input) 中的 phrase_from_stream/2 来构建一个简单的声明性谓词来解析来自 URL 的短语]:

phrase_from_url(Url, Phrase) :-
    http_open(Url, In, []),
    phrase_from_stream(Phrase, In),
    close(In).

但我怀疑 http_open/3 提供的流类型存在一些细微差别；我收到以下错误：

ERROR: set_stream_position/2: stream `<stream>(0x7feebbf5c810)' does not exist (Device not configured)

（我已经针对 library(http/http_open) 文档中提供的示例测试了相同的 url，该示例使用 copy_stream_data/2 将输出通过管道传输到 user_output，并且有效. 所以我知道 url 没有错。）

我了解到我可以将 url 中的数据下载到字符串、代码列表或文本文件中，然后在上面使用 phrase/n，我们的堂兄。但我希望有人可以帮助告知我...

...使用 DCG
...也许可以深入了解为什么我们不能像人们天真地希望的那样在某些流上使用 phrase_from_stream/2。

Answer 1

目前，library(pure_input)不支持非重新定位流。就是这个问题。

一个解决方案是读取所有内容，然后在其上使用正常的 phrase。这当然和承诺的不一样"lazy reading".

至于 "parsing data from URL"，请记住 SWI-Prolog 拥有可在网上找到的许多内容的库：SGML/XML/HTML; JSON; RDF.

要从 html 页面中挑选文本，请参阅 this simple scraper. The relevant code is in scrape/3 and its help predicates. It uses the SWI-Prolog SGML/XML parser and library(xpath)。

与此同时，如果您想使用 DCG 从非重新定位流中解析，那就倒霉了。 library(pure_input) 甚至对标准输入不起作用。根据数据的结构，您可以做的是使用 read_line_to_codes/3 (see the example), if your input is organized line-wise, or read_pending_input/3（如果不是），然后读取到缓冲区。

Answer 2

正如鲍里斯指出的那样，非重新定位流不能与库一起使用 (pure_input)。 read_stream_to_codes/2，然后是 phrase/2，将为您提供一个实用的方法来测试您的语法与真实数据的对比。

但是，'real world' HTML 非常难以解析（即使有内置 SGML 解析器的支持），因为错误处理不佳.因此调试 DCG 可能是一场噩梦，即使是在表现良好的语法上也是如此。

如何从 url 中读取一个短语以及 SWI-Prolog 的“open_http/2”提供什么样的流？

How to read a phrase from a url and what kind of stream does SWI-Prolog's `open_http/2` supply?

parsing

http

prolog

swi-prolog

dcg