如何 tokenize/parse/search&在 LibreOffice Writer 中按字体和字体样式替换文档？

Question

我需要更新用 Writer 编写的双语词典，方法是首先将所有条目解析为它们的部分，例如

主词（字体 1，粗体）
外国等效音译（字体 1，斜体）
外国等效（字体 2，粗体）
词性（字体 1，斜体）

文档的每一行都是主要词，后面是上面列出的部分，每个部分由 space 或标点符号分隔。

我需要自动执行逐行遍历整个文件的过程，并在每个部分之间放置一个分隔符，忽略 spaces 和标点符号，这样我就可以将其批量导入到 Calc 文件中.换句话说，"each part" 是具有相同字体和字体样式的字符序列（忽略 space 和标点符号）。

我已经尝试了标准的搜索和替换功能，以及 AltSearch 扩展，但都无法完成任务。主要问题是我无法编写如下内容的搜索查询：

查找：个相同字体的连续字符AND font_style，忽略space和标点符号

替换： 上面找到的术语 + "delimiter"

有什么建议我可以为此编写脚本，或者现有工具是否可以解决问题？

谢谢！

所需效果的伪代码：

var delimiter = "|"

Go to beginning of document

While not end of document do:
     var $currLine = get line from doc
     var $currChar = get next character which is not space or punctuation;
     var $font = currChar.font
     var $font_style - currChar.font_style (e.g. bold, italic, normal)

     While not end of line do:
         $currChar = next character which is not space or punctuation;

          if (currChar.font != $font || currChar.font_style != $font_style) { // font or style has changed
               print $delimiter

               $font = currChar.font
               $font_style - currChar.font_style (e.g. bold, italic, normal)
          }
     end While

end While

Answer 1

以下是伪代码所做的每件事的提示。

首先，逐行移动的最简单方法是 TextViewCursor, although it is slow. Notice the XLineCursor section. For the while loop, oVC.goDown() 到达文档末尾时将 return false。（oVC 是 TextViewCursor 的变量）。

通过调用 oVC.goRight(0, False) 到 deselect 然后调用 oVC.goRight(1, True) 到 select 来获取每个字符。然后通过oVC.getString()得到selected值。要忽略 space 和标点符号，也许使用 python 的 isalnum() or the re 模块。

要确定字符的字体，请调用oVC.getPropertyValue(attr)。 attr 的值可以简单地是 CharAutoStyleName 和 CharStyleName 以检查格式是否有任何变化。

或获取特定属性的列表，例如 'CharFontFamily', 'CharFontFamilyAsian', 'CharFontFamilyComplex', 'CharFontPitch', 'CharFontPitchAsian' 等。字符属性在 https://wiki.openoffice.org/wiki/Documentation/DevGuide/Text/Formatting 中进行了描述。

要在文本中插入分隔符：oVC.getText().insertString(oVC, "|", 0)。

This python code from github 展示了如何执行其中大部分操作，但您需要通读它才能找到相关部分。

或者，不使用 LibreOffice API，而是解压缩 .odt 文件并使用脚本解析 content.xml。

如何 tokenize/parse/search&在 LibreOffice Writer 中按字体和字体样式替换文档？

how to tokenize/parse/search&replace document by font AND font style in LibreOffice Writer?

scripting

parsing

tokenize

writer

libreoffice