阅读 excel 时如何跳过一些无效字符

when read excel how to skip some invalid characters

看了一些excel使用poi失败,遇到这样的错误

Caused by: org.xml.sax.SAXParseException; systemId: file://; lineNumber: 105; columnNumber: 147342; An invalid XML character (Unicode: 0xffff) was found in the element content of the document.
    at java.xml/com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:204)
    at java.xml/com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:178)
    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:400)

xl/sharedStrings.xml开始,存在<ffff>导致这个问题。

如何读取成功而忽略这些无效字符?例如

aaa <ffff> bbb ==> aaa bbb

那些无效的字符不应该在XML中,Excel本身也不会把它们放在那里。所以有人可能在使用 Excel 以外的其他东西创建该文件时做错了什么。应该避免该错误,而不是试图忽略症状。

但我知道依赖其他人在遥远的将来完成的工作是什么感觉,如果是的话。所以需要即兴创作。但在这种情况下,只有使用丑陋的低级方法才有可能。因为 XML 无效,所以无法解析 XML。所以只能替换字符串。

中我已经展示过了。在这种情况下,要替换 UTF-16-surrogate-pair 数字字符引用,它们在 XML.

中也是无效的

下面我将展示一个更灵活的代码,可以根据需要向 /xl/sharedStrings.xml 添加多个其他修复操作。

原理是利用OPCPackage,即*.xlsx ZIP包,将/xl/sharedStrings.xml作为文本串取出来。然后进行必要的更换并将修复的 /xl/sharedStrings.xml 放回 OPCPackage。然后从修复的 OPCPackage 而不是损坏的文件创建 XSSFWorkbook

import java.io.*;

import org.apache.poi.xssf.usermodel.XSSFWorkbook;
import org.apache.poi.openxml4j.opc.*;

import java.util.regex.Pattern;
import java.util.regex.Matcher;

class RepairSharedStringsTable {
    
 static String removeInvalidXmlCharacters(String string) {
  String xml10pattern = "[^"
                    + "\u0009\r\n"
                    + "\u0020-\uD7FF"
                    + "\uE000-\uFFFD"
                    + "\ud800\udc00-\udbff\udfff"
                    + "]";
  string = string.replaceAll(xml10pattern, "");
  return string;     
 }
    
 static void repairSharedStringsTable(OPCPackage opcPackage) {
  for (PackagePart packagePart : opcPackage.getPartsByName(Pattern.compile("/xl/sharedStrings.xml"))) {
   
   String sharedStrings = "";
   try (BufferedInputStream inputStream = new BufferedInputStream(packagePart.getInputStream());
        ByteArrayOutputStream sharedStringsBytes = new ByteArrayOutputStream() ) {
    byte[] buffer = new byte[1024];
    int length;
   
    while ((length = inputStream.read(buffer)) != -1) {
     sharedStringsBytes.write(buffer, 0, length);
    }
    sharedStrings = sharedStringsBytes.toString("UTF-8");
   } catch (Exception ex) {
    ex.printStackTrace();
   }
    
   System.out.println(sharedStrings);
   //sharedStrings = replaceUTF16SurrogatePairs(sharedStrings);
   sharedStrings = removeInvalidXmlCharacters(sharedStrings);
   //sharedStrings = doSomethingElse(sharedStrings);
   System.out.println(sharedStrings);

   try (BufferedOutputStream outputStream = new BufferedOutputStream(packagePart.getOutputStream()) ) {
    outputStream.write(sharedStrings.getBytes("UTF-8"));
   } catch (Exception ex) {
    ex.printStackTrace();
   }
  }  
 }

 public static void main(String[] args) throws Exception {
  try (XSSFWorkbook workbook = new XSSFWorkbook(new FileInputStream("./Excel.xlsx"))) {
   System.out.println("success");
  } catch (Exception ex) {
   System.out.println("failed");
   ex.printStackTrace();
  }

  OPCPackage opcPackage = OPCPackage.open(new FileInputStream("./Excel.xlsx"));
  repairSharedStringsTable(opcPackage);
  opcPackage.flush();
  
  try (XSSFWorkbook workbook = new XSSFWorkbook(opcPackage);
       FileOutputStream out = new FileOutputStream("./ExcelRepaired.xlsx");) {
   workbook.write(out);
   System.out.println("success");
  } catch (Exception ex) {
   System.out.println("failed");
   ex.printStackTrace();
  }
 }
}

在我的例子中,以下文件都包含无效字符

xl/sharedStrings.xml
xl/worksheets/sheet1.xml
xl/worksheets/sheet8.xml

所有这些xml都应该被处理

opcPackage.getPartsByName(Pattern.compile("(/xl/sharedStrings.xml)|(/xl/worksheets/.+\.xml)"))