使用 Google Apps 脚本从网页中提取数据时出现字符编码问题

Character encoding issue when using Google Apps Script to extract data from web page

我使用 Google Apps 脚本编写了一个脚本,用于将网页中的文本提取到 Google 表格中。我只需要这个脚本来处理特定的网页,所以它不需要通用。该脚本几乎完全按照我的要求工作,除了我有 运行 字符编码问题。我正在提取希伯来语和英语文本。 HTML 中的元标记具有 charset=Windows-1255。英语提取完美,但希伯来语显示为包含问号的黑色菱形。

我发现 this question 说要将数据传递到 blob,然后使用 getDataAsString 方法转换为另一种编码。我尝试转换为不同的编码并得到不同的结果。 UTF-8 显示带问号的黑色菱形,UTF-16 显示韩语,ISO 8859-8 returns 一个错误并说它不是一个有效参数,而原始的 Windows-1255 显示一个希伯来字符但是一堆其他的乱码。

但是,我可以手动将希伯来文文本复制并粘贴到 Google 表格中,并且它可以正确显示。

我什至测试过直接从 Google Apps 脚本代码传递希伯来语,如下所示:

function passHebrew() {
  return "וַיְדַבֵּר";
}

这会在 Google 表格上正确显示希伯来语文本。

我的代码如下:

function parseText(book, chapter) {
  //var bk = book;
  //var ch = chapter;
  var bk = '04'; //hard-coded for testing purposes
  var ch = '01'; //hard-coded for testing purposes
  var url = 'http://www.mechon-mamre.org/p/pt/pt' + bk + ch + '.htm';

  var xml = UrlFetchApp.fetch(url).getContentText();

  //I had to "fix" these xml errors for XmlService.parse(xml) below
  //to function.
  xml = xml.replace('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">', '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "">');
  xml = xml.replace('<LINK REL="stylesheet" HREF="p.css" TYPE="text/css">', '<LINK REL="stylesheet" HREF="p.css" TYPE="text/css"></LINK>');
  xml = xml.replace('<meta http-equiv="Content-Type" content="text/html; charset=Windows-1255">', '<meta http-equiv="Content-Type" content="text/html; charset=Windows-1255"></meta>');
  xml = xml.replace(/ALIGN=CENTER/gi, 'ALIGN="CENTER"');
  xml = xml.replace(/<BR>/gi, '<BR></BR>');
  xml = xml.replace(/class=h/gi, 'class="h"');

  //This section is the specific route to the table in the page I want
  var document = XmlService.parse(xml);
  var body = document.getRootElement().getChildren("BODY");
  var maintable = body[0].getChildren("TABLE");
  var maintablechildren = maintable[0].getChildren();

  //This creates a two-dimensional array so that I can store the Hebrew
  //in the first column and the English in the second column
  var array = new Array(maintablechildren.length);
  for (var i = 0; i < maintablechildren.length; i++) {
    array[i] = new Array(2);
  }

  //This is where the table gets parsed into the array
  for (var i = 0; i < maintablechildren.length; i++) {
    var verse = maintablechildren[i].getChildren();

    //This is where the encoding problem occurs.
    //I originally tried verse[0].getText() but it didn't work.
    array[i][0] = Utilities.newBlob(verse[0].getText()).getDataAsString('UTF-8');
    //This array receives the English text and works fine.
    array[i][1] = verse[1].getText();
  }

  return array;
}

我忽略、误解或做错了什么?我不太了解编码的工作原理,所以我不明白为什么无法将其转换为 UTF-8。

您的问题出现在您评论为编码问题的行之前:因为 UrlFetchApp 的默认编码从一开始就在处理 unicode 文本。

您应该使用 .getContentText() 方法的变体,即 Returns 将 HTTP 响应的内容编码为给定字符集的字符串。 对于你的情况:

var xml = UrlFetchApp.fetch(url).getContentText("Windows-1255");

这应该是您需要更改的全部内容,尽管不再需要 blob() 解决方法。 (虽然它是无害的。)其他评论:

  • 逻辑或运算符 (||) 对设置默认值很有帮助。我已经调整了前几行以启用测试,但仍然让该函数使用参数正常运行。

  • 在用字符串填充空数组之前设置空数组的方式很糟糕 JavaScript;这是不需要的复杂代码,所以扔掉它。相反,我们将声明 array 数组,然后在其上声明 push() 行。

  • 可以通过更巧妙地使用 RegExp 来减少 .replace() 函数;我已经包含了真正棘手的演示的 URL。

  • 文本中有 \n 个换行符,我猜这对您的目的来说是不必要的,因此也为它们添加了 replace()

这是您剩下的:

function parseText(book, chapter) {
  var bk = book || '04'; //hard-coded for testing purposes
  var ch = chapter || '01'; //hard-coded for testing purposes
  var url = 'http://www.mechon-mamre.org/p/pt/pt' + bk + ch + '.htm';

  var xml = UrlFetchApp.fetch(url).getContentText("Windows-1255");

  //I had to "fix" these xml errors for XmlService.parse(xml) below
  //to function.
  xml = xml.replace(/(<!DOCTYPE.*EN")>/gi, ' "">')
           .replace(/(<(LINK|meta).*>)/gi,'</>')        // https://regex101.com/r/nH3pU8/1
           .replace(/(<.*?=)([^"']*?)([ >])/gi,'""')  // https://regex101.com/r/eP7wO7/1
           .replace(/<BR>/gi, '<BR/>')
           .replace(/\n/g, '')

  //This section is the specific route to the table in the page I want
  var document = XmlService.parse(xml);
  var body = document.getRootElement().getChildren("BODY");
  var maintable = body[0].getChildren("TABLE");
  var maintablechildren = maintable[0].getChildren();

  //This is where the table gets parsed into the array
  var array = [];
  for (var i = 0; i < maintablechildren.length; i++) {
    var verse = maintablechildren[i].getChildren();

    //I originally tried verse[0].getText() but it didn't work.** It does now!
    var hebrew = verse[0].getText();
    //This array receives the English text and works fine.
    var english = verse[1].getText();
    array.push([hebrew,english]);
  }

  return array;
}

结果

 [
  [
    "  וַיְדַבֵּר יְהוָה אֶל-מֹשֶׁה בְּמִדְבַּר סִינַי, בְּאֹהֶל מוֹעֵד:  בְּאֶחָד לַחֹדֶשׁ הַשֵּׁנִי בַּשָּׁנָה הַשֵּׁנִית, לְצֵאתָם מֵאֶרֶץ מִצְרַיִם--לֵאמֹר.",
    " And the LORD spoke unto Moses in the wilderness of Sinai, in the tent of meeting, on the first day of the second month, in the second year after they were come out of the land of Egypt, saying:"
  ],
  [
    "  שְׂאוּ, אֶת-רֹאשׁ כָּל-עֲדַת בְּנֵי-יִשְׂרָאֵל, לְמִשְׁפְּחֹתָם, לְבֵית אֲבֹתָם--בְּמִסְפַּר שֵׁמוֹת, כָּל-זָכָר לְגֻלְגְּלֹתָם.",
    " 'Take ye the sum of all the congregation of the children of Israel, by their families, by their fathers' houses, according to the number of names, every male, by their polls;"
  ],
  [
    "  מִבֶּן עֶשְׂרִים שָׁנָה וָמַעְלָה, כָּל-יֹצֵא צָבָא בְּיִשְׂרָאֵל--תִּפְקְדוּ אֹתָם לְצִבְאֹתָם, אַתָּה וְאַהֲרֹן.",
    " from twenty years old and upward, all that are able to go forth to war in Israel: ye shall number them by their hosts, even thou and Aaron."
  ],
  ...