如何使用NodeJS读取大型utf-8编码的文本文件

Question

关于如何使用NodeJS读取utf-8编码的文本文件有很多答案；但是，我的问题是如何读取大文件。此处，"LARGE" 表示超出内存容量，例如 64GB。

假设我们有一个 64GB JSON 文件，其中文件包含 utf-8 字符；如何获取像 locale.ja.test=測定 这样的路径键值；例如，如果我们有 JSON 对象，例如 { "a": { "b": { "c": 1 } } }，pathKey a.b.c 的值指的是值 1.

如果是ascii编码的文本，我们可以简单的将文件分割成几个部分；例如，我们读取 100MB x 100MB 的文件，并使用像 parse(previousStat, block) -> stat 这样的解析器；但是对于 utf-8 编码的文本，问题在于，如果我们将文件分成几部分，对于某些极端情况，我们可能会将一个字符分成 2 个块。像 ...[=16=]x88[=16=]x12... -> [...\x88], [\x12...].

如何正确读取大的utf-8编码文本文件？谢谢！

注意：JSON文件可以写在一行中，这意味着readline可能没有帮助。

没有答案的类似问题：

Reading proper unicode characters into a ReadStream in node.js

Answer 1

经过反复试验，我找到了一个解决方案：

对于大文件，我们需要使用stream：例如fs.createReadStream('...')

对于 unicode，我们需要使用标志 encoding：fs.createReadStream('/path/to/file', { encoding: 'utf8', fd: null })

为了统计字节位置，我们需要把它转换成像Buffer.from(stream.read()).length

这样的Buffer

以 3-GRAM 和 2-GRAM 索引文本文件的完整示例：

这是测试 -> [Thi, his, is , s i, is , s t, te, tes, est] 和 [Th, hi, is, s , i, is, s , t, te, es, st]

const i_s = require('stream');
const i_fs = require('fs');

function buildIndexer(I) {
   // I = { indexStat: { cur: 0, gram: [] }, index: { gram2: {}, gram3: {} } }
   return i_s.Transform({
      transform: (chunk, _encoding, next) => {
         // _encoding should be 'utf8'
         const N = chunk.length;
         for (let i = 0; i < N; i++) {
            const ch = chunk[i];
            const len = Buffer.from(ch).length;
            I.indexStat.gram.push(ch);
            let gn = I.indexStat.gram.length - 1;
            if (gn > 3) {
               I.indexStat.gram.shift();
               gn --;
            }
            if (gn >= 2) {
               const g2 = `${I.indexStat.gram[gn-2]}${I.indexStat.gram[gn-1]}`;
               I.index.gram2[g2] = I.index.gram2[g2] || [];
               I.index.gram2[g2].push(I.indexStat.cur - Buffer.from(g2).length);
            }
            if (gn >= 3) {
               const g3 = `${I.indexStat.gram[gn-3]}${I.indexStat.gram[gn-2]}${I.indexStat.gram[gn-1]}`;
               I.index.gram3[g3] = I.index.gram3[g3] || [];
               I.index.gram3[g3].push(I.indexStat.cur - Buffer.from(g3).length);
            }
         next(null, chunk);
      },
      decodeStrings: false,
      encoding: 'utf8',
   });
}

const S = i_fs.createReadStream('/path/to/file', { encoding: 'utf8', fd: null });
const I = { indexStat: { cur: 0, gram: [] }, index: { gram2: {}, gram3: {} } }
const T = buildIndexer(I);
S.pipe(T);
S.on('finish', () => S.close());
T.on('finish', () => console.log(I.index));

如何使用NodeJS读取大型utf-8编码的文本文件

how to use NodeJS to read large utf-8 encoded text file

text-processing

utf-8

node.js