在水壶中运行 RSS 输入时偶尔出现 "Premature end of file" 错误？

Question

在 pentaho kettle 中，我使用一些 URL 配置了 RSS 输入步骤。当我运行转换时，大多数时候它运行是完美的，但有时，它会显示以下错误：

2016/06/29 13:10:48 - RSS Input.0 - ERROR (version 6.0.1.0-386, build 1 from 2015-12-03 11.37.25 by buildguy) : Unexpected Exception : it.sauronsoftware.feed4j.FeedXMLParseException: org.dom4j.DocumentException: Error on line -1 of document  : Premature end of file. Nested exception: Premature end of file.
2016/06/29 13:10:48 - RSS Input.0 - ERROR (version 6.0.1.0-386, build 1 from 2015-12-03 11.37.25 by buildguy) : it.sauronsoftware.feed4j.FeedXMLParseException: org.dom4j.DocumentException: Error on line -1 of document  : Premature end of file. Nested exception: Premature end of file.
2016/06/29 13:10:48 - RSS Input.0 -     at it.sauronsoftware.feed4j.FeedParser.parse(FeedParser.java:53)
2016/06/29 13:10:48 - RSS Input.0 -     at org.pentaho.di.trans.steps.rssinput.RssInput.readNextUrl(RssInput.java:168)
2016/06/29 13:10:48 - RSS Input.0 -     at org.pentaho.di.trans.steps.rssinput.RssInput.getOneRow(RssInput.java:198)
2016/06/29 13:10:48 - RSS Input.0 -     at org.pentaho.di.trans.steps.rssinput.RssInput.processRow(RssInput.java:312)
2016/06/29 13:10:48 - RSS Input.0 -     at org.pentaho.di.trans.step.RunThread.run(RunThread.java:62)
2016/06/29 13:10:48 - RSS Input.0 -     at java.lang.Thread.run(Thread.java:745)
2016/06/29 13:10:48 - RSS Input.0 - Caused by: org.dom4j.DocumentException: Error on line -1 of document  : Premature end of file. Nested exception: Premature end of file.
2016/06/29 13:10:48 - RSS Input.0 -     at org.dom4j.io.SAXReader.read(SAXReader.java:482)
2016/06/29 13:10:48 - RSS Input.0 -     at org.dom4j.io.SAXReader.read(SAXReader.java:291)
2016/06/29 13:10:48 - RSS Input.0 -     at it.sauronsoftware.feed4j.FeedParser.parse(FeedParser.java:37)
2016/06/29 13:10:48 - RSS Input.0 -     ... 5 more

我使用了kettle自带的默认RSS Input步骤，截图如下：

我在 RSS 提要中配置的链接是：

如何解决这个问题？即使我运行在其中一个链接上提供 RSS，它偶尔也会显示相同的错误。这个插件有问题吗？

Answer 1

主要问题是www.ft.com

出于某种原因，一段时间后网站服务器在中间断开连接，同时 python 实现能够从 http 流中读取所有数据并成功解析数据。

在我看来，构建 rss 响应的实现在网站上有一些错误。

Kettle 使用 feed4j 解析 rss。库 feed4j 利用简单的 HttpConnection 打开流并获取数据。

我编写了简单的代码来读取 HttpConnection io 流，我也遇到了同样的情况。 Web 服务器偶尔会断开连接。

使用 Apache HttpClient 请求相同的资源效果很好。没有错误，从服务器收到的所有数据。

我猜，对 http://ft.com 的请求需要格式正确的 http 请求，很可能是一些格式正确的 headers。

Answer 2

如果确实需要手动调整源码

只需获取 feed4j 的源代码。太老了，所以只有一个版本。

在编辑器中打开文件 it.sauronsoftware.feed4j.FeedParser.java

它只有一个方法 parse

public static Feed parse(Url url){
    SAXReader saxReader = new SAXReader();
    Document document = saxReader.read(url);
    ...

好员工 SAXReader 有几个重载方法，其中一个是你需要的

   saxParser.read(InputStream is)

不是将 url 传递给读取方法，而是编写代码使用 httpclient 从 url 读取数据（好消息是它与 kettle-pdi 捆绑在一起，但为了澄清版本，请查看 $KETTLE-HOME/lib/commons-httpclient-x.x.jar)

然后将httpclient从服务器接收到的数据包装到ByteArrayInputSteam中，传递给SaxReader

构建库并将 feed4j-1.0.jar 替换为您的

你完成了。

代码会像这样

public static Feed parse(Url url){
    SAXReader saxReader = new SAXReader();
    CloseableHttpClient client = HttpClients.createDefault();
    HttpGet get = new HttpGet(url);
    CloseableHttpResponse response = client.execute(get);
    HttpEntity entity = response.getEntity();
    byte[] b = new byte[(int)entity.getContentLength()];
    entity.getContent().read(b);
    InputStream is = new ByteArrayInputStream(b);

    Document document = saxReader.read(is);
    ...

额外的细节

可能需要添加代码以将可能的 IOException 包装到 FeedXMLParseException
此代码假设服务器 post Content-Length header 响应
使用匹配的 jdk 版本

在水壶中运行 RSS 输入时偶尔出现 "Premature end of file" 错误？

Occasional "Premature end of file" error while running RSS Input in kettle?

rss

pentaho

kettle

pentaho-spoon

在水壶中 运行 RSS 输入时偶尔出现 "Premature end of file" 错误？

Occasional "Premature end of file" error while running RSS Input in kettle?

rss

pentaho

kettle

pentaho-spoon

在水壶中运行 RSS 输入时偶尔出现 "Premature end of file" 错误？