使用 Javascript 对 Node.js 中的文本内容执行多个正则表达式过滤器

Question

我有多个正则表达式过滤器，我想运行在 Node 中的 .txt 文件上。我读取文件然后将内容设置为变量，然后我想用正则表达式解析内容以删除任何非法字符。

我最初尝试使用我发现的唯一可以执行此操作的 Node 模块之一，称为 https://www.npmjs.com/package/clean-text-utils - 但是它似乎是针对 Typescript 的，我无法让它与 Node 8.10 一起使用。所以我深入研究 node_module 找到相关的 JS 来尝试使用该函数替换非法字符。

如何运行 myTXT 变量上的所有正则表达式过滤器？目前，它只是输出带有不正确的非 ASCII 撇号的文本。

var myTXT;

...

const readFile = util.promisify(fs.readFile);
await readFile('/tmp/' + myfile, 'utf8')
    .then((text) => {
        console.log('Output contents: ', text);
        myTXT = text;
    })
    .catch((err) => {
        console.log('Error', err);
    });

var myTXT = function (myTXT) {
    var s = text
        .replace(/[‘’\u2018\u2019\u201A]/g, '\'')
        .replace(/[“”\u201C\u201D\u201E]/g, '"')
        .replace(/\u2026/g, '...')
        .replace(/[\u2013\u2014]/g, '-');
    return s.trim();
};

console.log('ReplaceSmartChars is', myTXT);

以下是从网页复制文本并粘贴到 .txt 文件中引起的撇号问题的示例，也显示在 PasteBin 中：

Resilience is what happens when we’re able to move forward even when things don’t fit together the way we expect.

And tolerances are an engineer’s measurement of how well the parts meet spec. (The word ‘precision’ comes to mind). A 2018 Lexus is better than 1968 Camaro because every single part in the car fits together dramatically better. The tolerances are more narrow now.

https://pastebin.com/uJ7GAKk4

复制自以下 URL 并粘贴到记事本并保存

https://seths.blog/storyoftheweek/

Answer 1

目前您没有调用执行替换的函数，而是用您的文本覆盖该函数。

const readFile = util.promisify(fs.readFile);

function replaceChars(text) {
   return text
        .replace(/[‘’\u2018\u2019\u201A]/g, '\'')
        .replace(/[“”\u201C\u201D\u201E]/g, '"')
        .replace(/\u2026/g, '...')
        .replace(/[\u2013\u2014]/g, '-')
        .trim();
}

const myTXT = await readFile('/tmp/' + myfile, 'utf8')
    .then((text) => {
        console.log('Output contents: ', text);
        return replaceChars(text);
    })
    .catch((err) => {
        console.log('Error', err);
    });

console.log('ReplaceSmartChars is', myTXT);

Answer 2

您应该将 console 放在 async lambda 中。并将 myTXT 函数重命名为不同于 myTXT 变量的名称。

试试下面的代码。

const fs = require('fs');

var myTXT;

(async () => {
    const readFile = util.promisify(fs.readFile);
    await readFile('/tmp/' + myfile, 'utf8')
      .then((text) => {
          console.log('Output contents: ', text);
          myTXT = text;
      })
      .catch((err) => {
          console.log('Error', err);
      });

    var replace = function (text) {
      var s = text
          .replace(/[‘’\u2018\u2019\u201A]/g, '\'')
          .replace(/[“”\u201C\u201D\u201E]/g, '"')
          .replace(/\u2026/g, '...')
          .replace(/[\u2013\u2014]/g, '-');
      return s.trim();
    };

  console.log('ReplaceSmartChars is', replace(myTXT));
})()

Answer 3

我不知道 clean-text-utils 所以我尝试了这个模块，它工作得很好：

const fs = require('fs-then-native')
const cleanTextUtils = require('clean-text-utils');

async function clean(file) {
  let txt = await fs.readFile(file, 'utf8');
  txt = cleanTextUtils.replace.exoticChars(txt);
  return txt;
}

clean('input.txt')
  .then(result => {
    console.log(result);
  });

使用 Javascript 对 Node.js 中的文本内容执行多个正则表达式过滤器

Perform multiple Regex filters on text content in Node.js with Javascript

javascript

regex

ascii

non-ascii-characters

node.js