NodeJS RTF ANSI 查找并用特殊字符替换单词

NodeJS RTF ANSI Find and Replace Words With Special Chars

我有一个查找和替换脚本,当单词没有任何特殊字符时,它可以正常工作。但是,很多时候 will 是特殊字符,因为它正在查找名称。截至目前,这破坏了脚本。

脚本查找 {<some-text>} 并尝试替换内容(以及删除大括号)。

示例:

text.rtf

Here's a name with special char {Kotouč}

script.ts

import * as fs from "fs";

// Ingest the rtf file.
const content: string = fs.readFileSync("./text.rtf", "utf8");
console.log("content::\n", content);

// The string we are looking to match in file text.
const plainText: string = "{Kotouč}";

// Look for all text that matches the patter `{TEXT_HERE}`.
const anyMatchPattern: RegExp = /{(.*?)}/gi;
const matches: string[] = content.match(anyMatchPattern) || [];
const matchesLen: number = matches.length;
for (let i: number = 0; i < matchesLen; i++) {

    // It correctly identifies the targeted text.
    const currMatch: string = matches[i];
    const isRtfMetadata: boolean = currMatch.endsWith(";}");
    if (isRtfMetadata) {
        continue;
    }

    // Here I need a way to escape `plainText` string so that it matches the source.
    console.log("currMatch::", currMatch);
    console.log("currMatch === plainText::", currMatch === plainText);
    if (currMatch === plainText) {
        const newContent: string = content.replace(currMatch, "IT_WORKS!");
        console.log("newContent:", newContent);
    }
}

输出

content::
 {\rtf1\ansi\ansicpg1252\cocoartf1671\cocoasubrtf600
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural\partightenfactor0

\f0\fs24 \cf0 Here's a name with special char \{Kotou\uc0\u269 \}.}

currMatch:: {Kotou\uc0\u269 \}

currMatch === plainText:: false

它看起来像 ANSI 转义,我试过使用 jsesc 但它产生了不同的字符串,{Kotou\u010D} 而不是文档产生的字符串 {Kotou\uc0\u269 \}.

如何动态转义 plainText 字符串变量,使其与文档中的内容匹配?

我需要的是加深我对 rtf 格式以及一般文本编码的了解。

从文件中读取的原始 RTF 文本给了我们一些提示:

{\rtf1\ansi\ansicpg1252\cocoartf1671\cocoasubrtf600...

这部分 rtf 文件元数据告诉我们一些事情。

它使用的是 RTF 文件格式版本 1。编码是 ANSI,特别是 cpg1252,也称为 Windows-1252CP-1252 即:

...a single-byte character encoding of the Latin alphabet

(source)

其中有价值的信息是我们知道它使用的是拉丁字母表,稍后会用到。

了解我使用的特定 RTF 版本后,我偶然发现了 RTF 1.5 Spec

在该规范中快速搜索我正在研究的转义序列之一,发现它是一个特定于 RTF 的转义 控制序列,即 \uc0 .所以知道我能够解析我真正想要的东西,\u269。现在我知道它是 unicode 并且有一个很好的预感 \u269 代表 unicode character code 269。所以我查了一下...

\u269(字符代码269shows up on this page to confirm. Now I know the character set and what needs done to get the equivalent plain text (unescaped), and there's a basic 启动函数。

利用所有这些知识,我能够从那里拼凑起来。这是完整的更正脚本及其输出:

script.ts

import * as fs from "fs";


// Match RTF unicode control sequence: http://www.biblioscape.com/rtf15_spec.htm
const unicodeControlReg: RegExp = /\uc0\u/g;

// Extracts the unicode character from an escape sequence with handling for rtf.
const matchEscapedChars: RegExp = /\uc0\u(\d{2,6})|\u(\d{2,6})/g;

/**
 * Util function to strip junk characters from string for comparison.
 * @param {string} str
 * @returns {string}
 */
const cleanupRtfStr = (str: string): string => {
    return str
        .replace(/\s/g, "")
        .replace(/\/g, "");
};

/**
 * Detects escaped unicode and looks up the character by that code.
 * @param {string} str
 * @returns {string}
 */
const unescapeString = (str: string): string => {
    const unescaped = str.replace(matchEscapedChars, (cc: string) => {
        const stripped: string = cc.replace(unicodeControlReg, "");
        const charCode: number = Number(stripped);

        // See unicode character codes here:
        //  https://unicodelookup.com/#latin/11
        return String.fromCharCode(charCode);
    });

    // Remove all whitespace.
    return unescaped;
};

// Ingest the rtf file.
const content: string = fs.readFileSync("./src/TEST.rtf", "binary");
console.log("content::\n", content);

// The string we are looking to match in file text.
const plainText: string = "{Kotouč}";

// Look for all text that matches the pattern `{TEXT_HERE}`.
const anyMatchPattern: RegExp = /{(.*?)}/gi;
const matches: string[] = content.match(anyMatchPattern) || [];
const matchesLen: number = matches.length;
for (let i: number = 0; i < matchesLen; i++) {
    const currMatch: string = matches[i];
    const isRtfMetadata: boolean = currMatch.endsWith(";}");
    if (isRtfMetadata) {
        continue;
    }

    if (currMatch === plainText) {
        const newContent: string = content.replace(currMatch, "IT_WORKS!");
        console.log("\n\nnewContent:", newContent);
        break;
    }

    const unescapedMatch: string = unescapeString(currMatch);
    const cleanedMatch: string = cleanupRtfStr(unescapedMatch);
    if (cleanedMatch === plainText) {
        const newContent: string = content.replace(currMatch, "IT_WORKS_UNESCAPED!");
        console.log("\n\nnewContent:", newContent);
        break;
    }
}

输出

content::
 {\rtf1\ansi\ansicpg1252\cocoartf1671\cocoasubrtf600
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\partightenfactor0

\f0\fs24 \cf0 Here\'92s a name with special char \{Kotou\uc0\u269 \}}


newContent: {\rtf1\ansi\ansicpg1252\cocoartf1671\cocoasubrtf600
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\partightenfactor0

\f0\fs24 \cf0 Here\'92s a name with special char \IT_WORKS_UNESCAPED!}

希望对那些不熟悉字符 encoding/escaping 和它在 rtf 格式文档中的使用的人有所帮助!