Ruby：检查字节顺序标记

Question

在Rails中，我们将一些文本文件作为ISO-8859-1。有时文件以 UTF-8 with BOM 的形式出现。我正在尝试确定它是否 UTF-8 with BMO 然后重新读取文件为 bom|UTF-8.

我尝试了以下方法，但似乎无法正确比较：

# file is saved as UTF-8 with BOM using Sublime Text 2

> string = File.read(file, encoding: 'ISO-8859-1')

# this doesn't work, while it supposed to work
> string.start_with?("\xef\xbb\xbf".force_encoding("UTF-8"))
> false

# it works if I try this
> string.start_with?('ï»¿')
> true

目的是如果文件的开头有字节顺序标记，则将文件读取为 UTF-8 with BOM，我想避免使用 string.start_with?('ï»¿') 方法。

Answer 1

string.start_with?("\u00ef\u00bb\u00bf")

来自Ruby official documentation：

\xnn hexadecimal bit pattern, where nn is 1-2 hexadecimal digits ([0-9a-fA-F])

\unnnn Unicode character, where nnnn is exactly 4 hexadecimal digits ([0-9a-fA-F])

就是说，要插入一个 unicode 字符，应该使用 \uXXXX 表示法。它是安全的，我们可以可靠地使用这个版本。

Answer 2

这对我不起作用，我必须检查字节数。

string[0].bytes ==  [239, 187, 191] # true for UTF-8 + BOM

See BOM for other encodings

如果你只是想检查文件，然后正确地重新打开它（例如File.open(file, "r:bom|utf-8")）。

那么你不需要整个文件，只需要读取前 3 个字节

is_bom = File.open(file) { |f| f.read(3).bytes ==  [239, 187, 191] }

Ruby：检查字节顺序标记

Ruby: Check for Byte Order Marker

ruby

encoding

byte-order-mark

ruby-on-rails