使用 Watson 的文档转换服务时，是什么导致 "Error in the web application" 错误？

Question

我正在尝试使用 Watson 的 Document Conversion 服务将一些 PDF 文件转换为应答单元。这些文件全部压缩成一个大的 .zip 文件，上传到我的 Bluemix 服务器运行一个 Node.js 应用程序。应用程序将文件解压缩到内存中并尝试将每个文件依次发送到转换服务：

var document_conversion = watson.document_conversion(dcCredentials);

function createCollection(res, solrClient, docs)
   {
   for (var doc in docs) //docs is an array of objects describing the pdf files
      {
      console.log("Converting: %s", docs[doc].filename);

      //make a stream of this pdf file
      var rs = new Readable;    //create the stream
      rs.push(docs[doc].data);  //add pdf file (string object) to stream
      rs.push(null);        //end of stream marker

      document_conversion.convert(
         {
         file: rs,
         conversion_target: "ANSWER_UNITS"
         }, 
         function (err, response) 
            {
            if (err) 
               {
               console.log("Error converting doc: ", err);
        .
        .
        .
        etc...

每次，转换服务 returns 错误 400 与描述 "Error in the web application"。

我绞尽脑汁想找出这个无用的错误消息的原因两天后，我几乎确定问题一定是转换服务无法确定文件的类型已发送，因为没有与之关联的文件名。这当然只是我的猜测，但我无法检验这个理论，因为我不知道如何在不实际将每个文件写入磁盘并读回的情况下向服务提供该信息。

有人可以帮忙吗？

Answer 1

下面的代码迭代一个 zip 文件并将每个文档转换为 ANSWER_UNITS。
它使用 node-unzip-2 并且 zip 文件 documents.zip 包含这些 3 sample files.

var unzip  = require('node-unzip-2');
var watson = require('watson-developer-cloud');
var fs     = require('fs');

var document_conversion = watson.document_conversion({
  username: 'USERNAME',
  password: 'PASSWORD',
  version_date: '2015-12-01',
  version:  'v1'
});

function convertDocument(doc) {
  document_conversion.convert({
    file: doc,
    conversion_target: document_conversion.conversion_target.ANSWER_UNITS,
  }, function (err, response) {
    if (err) {
      console.error(doc.path,'error:',err);
    } else {
      console.log(doc.path,'OK');
      // hide the results for now
      //console.log(JSON.stringify(response, null, 2));
    }
  });
}

fs.createReadStream('documents.zip')
  .pipe(unzip.Parse())
  .on('entry', function (entry) {
    if (entry.type === "File") {
      convertDocument(entry);
    } else {
      // Prevent out of memory issues calling autodrain for non processed entries
      entry.autodrain();
    }
  });

示例输出：

$ node app.js
 sampleHTML.html OK
 sampleWORD.docx OK
 samplePDF.pdf OK

Answer 2

更新：问题在于底层 form-data 库如何处理流：It doesn't calculate the length of Streams（文件和请求流除外，它有额外的逻辑要处理）。

getLengthSync() method DOESN'T calculate length for streams, use knownLength options as workaround.

我找到了两种解决方法。自己计算长度并将其作为选项传递：

document_conversion.convert({
  file: { value: rs, options: { knownLength: 12345 } }
  ...

或使用缓冲区：

document_conversion.convert({
  file: { value: myBuffer, options: {} }
  ...

您收到 400 响应的原因是您请求的 Content-Length header 计算错误：请求的长度太小，导致 MIME 部分要截断（而不是关闭）的请求。

我怀疑这是由于 Readable 流在请求库计算实体大小时没有为您的内容提供长度或大小。

此外，对于无用的错误消息，我们深表歉意。我们会做得更好。

使用 Watson 的文档转换服务时，是什么导致 "Error in the web application" 错误？

What causes an "Error in the web application" error when using Watson's document conversion service?

pdf

stream

node.js

ibm-watson

ibm-cloud