为什么要包含 <meta charset=“” />?

Why to include <meta charset=“” />?

我的意思是,如果浏览器已经在读取 HTML 文件并且能够读取文本 <meta charset=“” />,则意味着它已经知道 HTML 文件的编码。那么为什么需要在HTML文件里面指定呢?这不是多余的吗?
是因为浏览器开始使用最小的字符集读取文件,比如 ASCII,并且它是许多字符集的子集吗?

这是一个过时的标签,但原因是:我们有 ISO 646(自 1967 年起)定义了一组标准字符。 ASCII 指定了 ISO 646 上的少数可选字符,因此 ISO 646 是大多数编码之母。

注意:大多数系统都基于此标准,ev。使用 extension ISO 2022,您可以在其中使用几种不同的编码对 7 位和 8 位字符进行编码(例如,用于亚洲字符集,我们需要超过 256 个字符)。无论如何,文本的开头与 ISO 646 兼容。然后控制序列可能会改变含义。

因此浏览器可以读取大部分 ASCII 数据(实际上是 ISO 646、ISO 2022),并准确检测如何解释所有其他字符。

在西方语言中,您获得的主要是低位代码的 ASCII(直到 127),但如何解释高位代码取决于语言(北欧字符、西方重音字符、希腊字符等)。并且还有各种编码,没有明确说明是无法真正检测到的。

注意:此方法在少数编码上失败,例如多字节,如 UCS-2、UTF-16、UTF-32,但 W3C 有一些方法可以检测到它:header 应该主要是 ASCII 字符集,所以我们应该有很多 00 字符。 EBCDIC 和其他不基于 ISO 646(或 ASCII)的编码已经很少见了。原则上你可以检查一些字节串,但我不知道浏览器是否做到了。

简而言之:通过启发式(和 ISO 646)您可以猜测如何读取 ASCII 字符集,但要知道如何解释“特殊字符”,例如重音字符,我们必须有更多信息,由 META 或 HTTP header 提供。注意:这也适用于许多亚洲编码(基于 ISO 2022)

为什么选择 META?这是关于控制。 HTTP header 通常需要网站管理员干预,但使用 META,页面作者可以覆盖编码。 (例如编写静态页面,现在大多数动态页面生成器都可以覆盖 HTTP headers)。

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

For a web page, the original idea was that the web server would return a similar Content-Type http header along with the web page itself — not in the HTML itself, but as one of the response headers that are sent before the HTML page.

This causes problems. Suppose you have a big web server with lots of sites and hundreds of pages contributed by lots of people in lots of different languages and all using whatever encoding their copy of Microsoft FrontPage saw fit to generate. The web server itself wouldn’t really know what encoding each file was written in, so it couldn’t send the Content-Type header.

It would be convenient if you could put the Content-Type of the HTML file right in the HTML file itself, using some kind of special tag. Of course this drove purists crazy… how can you read the HTML file until you know what encoding it’s in?! Luckily, almost every encoding in common use does the same thing with characters between 32 and 127, so you can always get this far on the HTML page without starting to use funny letters:

<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

But that meta tag really has to be the very first thing in the section because as soon as the web browser sees this tag it’s going to stop parsing the page and start over after reinterpreting the whole page using the encoding you specified.

另见 W3.org:

Always declare the encoding of your document using a meta element with a charset attribute, or using the http-equiv and content attributes (called a pragma directive). The declaration should fit completely within the first 1024 bytes at the start of the file, so it's best to put it immediately after the opening head tag.

是的。整个前提是,在您的浏览器的 HTML 解析器读取该元标记之前,不应该有任何字节可以被模棱两可地解释为其他字节;显示的整个文本,包括字符集属性值(“utf-8”)适合 ASCII 编码。

来自乔尔的文章:

Internet Explorer actually does something quite interesting: it tries to guess, based on the frequency in which various bytes appear in typical text in typical encodings of various languages, what language and encoding was used. Because the various old 8 bit code pages tended to put their national letters in different ranges between 128 and 255, and because every human language has a different characteristic histogram of letter usage, this actually has a chance of working.

一般的 HTML 解析器是这样的:

  1. 是否有带有字符集参数的 Content-Type 响应 header?使用它将接收到的内容的字节解码为字符串。
  2. 开始以 ASCII(或 UTF-8)格式读取 HTML。有可用字符集的 <meta http-equiv="Content-Type"> header 吗?使用那个。
  3. 开始解析字节并使用试探法来确定最有可能使用的编码。