使用 java 和 Tika 将字符串拆分为数组的正则表达式

Question

我正在尝试获取 Tika 输出（pdf 到文本）并将结果拆分为单词数组或字符组。

我正在使用类似....

String str = contenthandler.toString();
  String[] splitArray = str.split("\s+");

  for (String word : splitArray){
    System.out.println(word);
  }

但我并没有在我期望的地方分裂——在单词之间。我想保留换行符、页面、制表符等...并且只删除白色 space。 Tika 的示例文本如下所示：

"...or supplemented except by a written instrument signed by both parties.  The unenforceability of any provision on this Agreement shall not affect the enforceability of any other provision of this Agreement.  Neither this Agreement nor the disclosure of any Confidential Information pursuant to this Agreement by any party shall restrict such party from disclosing any of its Confidential Information to any third party...."

我在 http://java-regex-tester.appspot.com/

上玩正则表达式

像 [^a-zA-Z] 这样的模式可以找到 spaces，而 /s+ 则不能。我怎么分给这些家伙？

Answer 1

制表符和换行符是白色的space。如果您只想拆分一个或多个 space 个字符，则需要执行

String[] splitArray = str.split(" +");

编辑

响应 OP 评论 - space 似乎与 \s+ 不匹配。在这种情况下，单词之间的字符 (spaces) 是 none of [" ",\t, \n, \x0B, \f, \r\]*。您可以尝试匹配 \b （这是一个单词边界）。要真正找出字符是什么——将字符串粘贴到一个好的文本编辑器中并查看原始字符（例如，在 Notepad++ 中，它将是查看 -> 显示所有字符）。注意单词之间字符的十六进制代码并检查它是什么。

在 OP 测试后进行编辑

通过检查文本的十六进制表示（通过 edithex.com），OP 确定 space 字符是一个不间断的 space (0xA0)。因此，这段代码满足要求：

String[] splitArray = str.split("\xA0")

似乎 PDF 通常将 space 编码为标准 space (0xA0) 以外的字符。这个blogpost implies that PDFs might not encode spaces as standard spaces (ASCII code 0x20 = 32). The various options for space characters that \s will not pick up are here.

*在示例文本中，它们是 space，但必须已在复制/粘贴中更改

使用 java 和 Tika 将字符串拆分为数组的正则表达式

Regex splitting a string to an array using java and Tika

java

regex

apache-tika