juniversalchardet 在 www.wikipedia.org 上有缺陷

Question

我正在尝试使用 juniversalchardet 对保存的网页进行 auto-detect 编码，我的第一个测试使用 www.wikipedia.org，它根据 HTTP 响应 header 使用 UTF-8 编码（此信息在保存到磁盘后丢失）

这是我这样做的 scala 代码：

    val content = <...load Wikipedia.html from disk...>
    val charsetD = new UniversalDetector(null)
    charsetD.handleData(content, 0, content.length)
    val charset = charsetD.getDetectedCharset

然而无论我加载什么，字符集结果总是'null'，是因为juniversalchardet库有缺陷吗？还是我用错了？

Answer 1

问题已解决，charsetD.handleData(content, 0, content.length) 无法处理超过 4096 字节的批处理。在数据块上多次使用此函数后一切正常。

juniversalchardet 在 www.wikipedia.org 上有缺陷

juniversalchardet is defective on www.wikipedia.org

scala

character-encoding

chardet