如何防止 FormatException: Unfinished UTF-8 octet sequence

How can I prevent FormatException: Unfinished UTF-8 octet sequence

我已经下载了维基百科转储,我正在尝试逐行阅读它。但是在进行 utf8 解码时出现以下错误

12633: FormatException: Unfinished UTF-8 octet sequence (at offset 65536)

Stacktrace :#0      _Utf8Decoder.convertSingle (dart:convert-patch/convert_patch.dart:1789:7)
#1      Utf8Decoder.convert (dart:convert/utf.dart:351:42)
#2      Utf8Codec.decode (dart:convert/utf.dart:63:20)
#3      _MapStream._handleData (dart:async/stream_pipe.dart:213:31)
#4      _ForwardingStreamSubscription._handleData (dart:async/stream_pipe.dart:153:13)
#5      _RootZone.runUnaryGuarded (dart:async/zone.dart:1618:10)
#6      _BufferingStreamSubscription._sendData (dart:async/stream_impl.dart:341:11)
#7      _BufferingStreamSubscription._add (dart:async/stream_impl.dart:271:7)
#8      _SyncStreamControllerDispatch._sendData (dart:async/stream_controller.dart:774:19)
#9      _StreamController._add (dart:async/stream_controller.dart:648:7)
#10     _StreamController.add (dart:async/stream_controller.dart:596:5)
#11     _FileStream._readBlock.<anonymous closure> (dart:io/file_impl.dart:98:19)
<asynchronous suspension>

也就是这条线

ar ✿๑_غالاباغوس 1 0

所以我尝试保存使用此按钮编码的 utf-8 文件

但这似乎不起作用

这是我的代码

final filePath = p.join(
    Directory.current.path,
    'bin\migrate_most_views\data\pageviews-20220416-170000',
  );
  final file = File(filePath);

  logger.stderr('exporting pageviews...');

  StreamSubscription? reader;
  int lineNumer = 0;
  reader = file.openRead().map(utf8.decode).transform(LineSplitter()).listen(
    (line) {
      final page = MostViewedPageDaily.fromLine(line);
      db.collection('page_views').insert(page.toMap());

      lineNumer++;
      if (lineNumer % 1000 == 0) {
        logger.stdout('inserting at line $lineNumer');
      }
    },
    onDone: () {
      logger.stdout('Reader read $lineNumer lines');
      reader?.cancel();
      exit(0);
    },
    onError: (error, stackTrace) {
      final message = '$lineNumer: $error\n\nStacktrace :$stackTrace';
      logger.stdout(logger.ansi.error(message));
      exit(1);
    },
    cancelOnError: true,
  );

我能做什么?

我从这里下载文件

https://dumps.wikimedia.org/other/pageviews/2022/2022-04/pageviews-20220417-010000.gz

您应该使用 file.openRead().transform(utf8.decoder) 而不是 file.openRead().map(utf8.decode)。 (另请注意参数差异:utf8.decoder is a Utf8Decoder object, and utf8.decode 是一种方法 tear-off。)

Stream.map documentation专门讨论这个:

Unlike transform, this method does not treat the stream as chunks of a single value. Instead each event is converted independently of the previous and following events, which may not always be correct. For example, UTF-8 encoding, or decoding, will give wrong results if a surrogate pair, or a multibyte UTF-8 encoding, is split into separate events, and those events are attempted encoded or decoded independently.