如何防止 FormatException: Unfinished UTF-8 octet sequence
How can I prevent FormatException: Unfinished UTF-8 octet sequence
我已经下载了维基百科转储,我正在尝试逐行阅读它。但是在进行 utf8 解码时出现以下错误
12633: FormatException: Unfinished UTF-8 octet sequence (at offset 65536)
Stacktrace :#0 _Utf8Decoder.convertSingle (dart:convert-patch/convert_patch.dart:1789:7)
#1 Utf8Decoder.convert (dart:convert/utf.dart:351:42)
#2 Utf8Codec.decode (dart:convert/utf.dart:63:20)
#3 _MapStream._handleData (dart:async/stream_pipe.dart:213:31)
#4 _ForwardingStreamSubscription._handleData (dart:async/stream_pipe.dart:153:13)
#5 _RootZone.runUnaryGuarded (dart:async/zone.dart:1618:10)
#6 _BufferingStreamSubscription._sendData (dart:async/stream_impl.dart:341:11)
#7 _BufferingStreamSubscription._add (dart:async/stream_impl.dart:271:7)
#8 _SyncStreamControllerDispatch._sendData (dart:async/stream_controller.dart:774:19)
#9 _StreamController._add (dart:async/stream_controller.dart:648:7)
#10 _StreamController.add (dart:async/stream_controller.dart:596:5)
#11 _FileStream._readBlock.<anonymous closure> (dart:io/file_impl.dart:98:19)
<asynchronous suspension>
也就是这条线
ar ✿๑_غالاباغوس 1 0
所以我尝试保存使用此按钮编码的 utf-8 文件
但这似乎不起作用
这是我的代码
final filePath = p.join(
Directory.current.path,
'bin\migrate_most_views\data\pageviews-20220416-170000',
);
final file = File(filePath);
logger.stderr('exporting pageviews...');
StreamSubscription? reader;
int lineNumer = 0;
reader = file.openRead().map(utf8.decode).transform(LineSplitter()).listen(
(line) {
final page = MostViewedPageDaily.fromLine(line);
db.collection('page_views').insert(page.toMap());
lineNumer++;
if (lineNumer % 1000 == 0) {
logger.stdout('inserting at line $lineNumer');
}
},
onDone: () {
logger.stdout('Reader read $lineNumer lines');
reader?.cancel();
exit(0);
},
onError: (error, stackTrace) {
final message = '$lineNumer: $error\n\nStacktrace :$stackTrace';
logger.stdout(logger.ansi.error(message));
exit(1);
},
cancelOnError: true,
);
我能做什么?
我从这里下载文件
https://dumps.wikimedia.org/other/pageviews/2022/2022-04/pageviews-20220417-010000.gz
您应该使用 file.openRead().transform(utf8.decoder)
而不是 file.openRead().map(utf8.decode)
。 (另请注意参数差异:utf8.decoder
is a Utf8Decoder
object, and utf8.decode
是一种方法 tear-off。)
Stream.map
documentation专门讨论这个:
Unlike transform
, this method does not treat the stream as chunks of a single value. Instead each event is converted independently of the previous and following events, which may not always be correct. For example, UTF-8 encoding, or decoding, will give wrong results if a surrogate pair, or a multibyte UTF-8 encoding, is split into separate events, and those events are attempted encoded or decoded independently.
我已经下载了维基百科转储,我正在尝试逐行阅读它。但是在进行 utf8 解码时出现以下错误
12633: FormatException: Unfinished UTF-8 octet sequence (at offset 65536)
Stacktrace :#0 _Utf8Decoder.convertSingle (dart:convert-patch/convert_patch.dart:1789:7)
#1 Utf8Decoder.convert (dart:convert/utf.dart:351:42)
#2 Utf8Codec.decode (dart:convert/utf.dart:63:20)
#3 _MapStream._handleData (dart:async/stream_pipe.dart:213:31)
#4 _ForwardingStreamSubscription._handleData (dart:async/stream_pipe.dart:153:13)
#5 _RootZone.runUnaryGuarded (dart:async/zone.dart:1618:10)
#6 _BufferingStreamSubscription._sendData (dart:async/stream_impl.dart:341:11)
#7 _BufferingStreamSubscription._add (dart:async/stream_impl.dart:271:7)
#8 _SyncStreamControllerDispatch._sendData (dart:async/stream_controller.dart:774:19)
#9 _StreamController._add (dart:async/stream_controller.dart:648:7)
#10 _StreamController.add (dart:async/stream_controller.dart:596:5)
#11 _FileStream._readBlock.<anonymous closure> (dart:io/file_impl.dart:98:19)
<asynchronous suspension>
也就是这条线
ar ✿๑_غالاباغوس 1 0
所以我尝试保存使用此按钮编码的 utf-8 文件
但这似乎不起作用
这是我的代码
final filePath = p.join(
Directory.current.path,
'bin\migrate_most_views\data\pageviews-20220416-170000',
);
final file = File(filePath);
logger.stderr('exporting pageviews...');
StreamSubscription? reader;
int lineNumer = 0;
reader = file.openRead().map(utf8.decode).transform(LineSplitter()).listen(
(line) {
final page = MostViewedPageDaily.fromLine(line);
db.collection('page_views').insert(page.toMap());
lineNumer++;
if (lineNumer % 1000 == 0) {
logger.stdout('inserting at line $lineNumer');
}
},
onDone: () {
logger.stdout('Reader read $lineNumer lines');
reader?.cancel();
exit(0);
},
onError: (error, stackTrace) {
final message = '$lineNumer: $error\n\nStacktrace :$stackTrace';
logger.stdout(logger.ansi.error(message));
exit(1);
},
cancelOnError: true,
);
我能做什么?
我从这里下载文件
https://dumps.wikimedia.org/other/pageviews/2022/2022-04/pageviews-20220417-010000.gz
您应该使用 file.openRead().transform(utf8.decoder)
而不是 file.openRead().map(utf8.decode)
。 (另请注意参数差异:utf8.decoder
is a Utf8Decoder
object, and utf8.decode
是一种方法 tear-off。)
Stream.map
documentation专门讨论这个:
Unlike
transform
, this method does not treat the stream as chunks of a single value. Instead each event is converted independently of the previous and following events, which may not always be correct. For example, UTF-8 encoding, or decoding, will give wrong results if a surrogate pair, or a multibyte UTF-8 encoding, is split into separate events, and those events are attempted encoded or decoded independently.