将所有法语口音转换为 HTML 字符格式

Convert all french accents into HTML character format

例如,我有一堆 HTML 这样的页面:

<!DOCTYPE html>
<html>
<head><title>Table des matières</title>
<meta http-equiv="Content-Type" content="text/html; charset="utf-8"" />
<meta name="generator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<meta name="originator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<!-- 3,html,xhtml,charset="utf-8" -->
<meta name="src" content="content_final.tex" />
<link rel="stylesheet" type="text/css" href="content_final.css" />
 <script type="text/javascript" src="./jquery.js">
</script>
<script type="text/javascript">
$(document).ready(function() {
function capitalizeFirstLetter(string) {
return string.charAt(0).toUpperCase() + string.slice(1).toLowerCase();
}
$('div.caption span.id').each(function() { var result = $(this).text().replace(':','');
result=capitalizeFirstLetter(result);
$(this).text(result);
});
});
</script>
</head><body
>
<!--l. 125--><div class="crosslinks"><p class="noindent">[<a
href="chapter1.html" >next</a>] [<a
href="#tailcontent.html">tail</a>] [<a
href="/sciences/index.html" >up</a>] </p></div>
<h2 class="likechapterHead"><a
 id="x2-1000"></a>Table des matières</h2>
<div class="tableofcontents">

但不可能像上面的重音一样转换这些 HTML 页面中的所有法语口音 “Table des matières”出现“è”而不是“&egrave;”。

我尝试了两件事:

  1. for i in $(ls *.html); do iconv -f iso-8859-1 -t utf8 $i > $i"_new"; mv -f $i"_new" $i; done

=> 重音未转换

  1. for i in $(ls *.html); do recode ..html $i; done

=> 我有以下错误:

recode: section5.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
recode: section6.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
recode: section7.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
recode: section8.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
recode: section9.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
recode: table_of_contents.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
...

我不知道如何转换所有这些法国口音?

有没有人有转换所有可能的法语口音的想法或建议?我想使用 iconvrecodesed 命令。

更新 1: 以一个基本示例为例,这是我收到的单个文件的消息:

$ recode ..html table_of_contents.html
recode: table_of_contents.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2' 

怎么了?

更新 2: 这是我原来的 HTML 页的输出:

$file -i index.html

$ index.html: text/x-tex; charset=iso-8859-1

index.html的负责人:

<!DOCTYPE html>
<html>
<head><title>Table des matières</title>
<meta http-equiv="Content-Type" content="text/html; charset="utf-8"" />
<meta name="generator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<meta name="originator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<!-- 3,html,xhtml,charset="utf-8" -->
<meta name="src" content="content_final.tex" />
<link rel="stylesheet" type="text/css" href="content_final.css" />
 <script type="text/javascript" src="./jquery.js">
</script>
<script type="text/javascript">
$(document).ready(function() {
function capitalizeFirstLetter(string) {
return string.charAt(0).toUpperCase() + string.slice(1).toLowerCase();
}
$('div.caption span.id').each(function() { var result = $(this).text().replace(':','');
result=capitalizeFirstLetter(result);
$(this).text(result);

如果我应用命令:

$ recode -vfd u8..html index.html

Request: UTF-8..:libiconv:..ISO-10646-UCS-2..HTML_4.0
Shrunk to: UTF-8..ISO-10646-UCS-2..HTML_4.0
Recoding index.html... done

<!DOCTYPE html>
<html>
<head><title>Table des matires</title>
<meta http-equiv="Content-Type" content="text/html; charset="utf-8"" />
<meta name="generator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<meta name="originator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<!-- 3,html,xhtml,charset="utf-8" -->
<meta name="src" content="content_final.tex" />
<link rel="stylesheet" type="text/css" href="content_final.css" />
 <script type="text/javascript" src="./jquery.js">
</script>
<script type="text/javascript">
$(document).ready(function() {
function capitalizeFirstLetter(string) {
return string.charAt(0).toUpperCase() + string.slice(1).toLowerCase();
}
$('div.caption span.id').each(function() { var result = $(this).text().replace(':','');
result=capitalizeFirstLetter(result);
$(this).text(result);
});
});
</script>

如您所见,“è”消失了。

我能做什么?

假设源文件编码为UTF-8。以下命令在我的环境中有效:

$ recode -vfd u8..html index.html

输出:

$ locale charmap
UTF-8

$ file -i index.html
index.html: text/html; charset=utf-8

$ recode -vfd u8..html index.html
Request: UTF-8..:iconv:..ISO-10646-UCS-2..HTML_4.0
Shrunk to: UTF-8..ISO-10646-UCS-2..HTML_4.0
Recoding index.html... done

您可以使用命令选项通过这种方式调试错误:

  • -v 详细输出。有助于查找错误发生在哪一步。
  • -f 即使发生错误也强制完成。您可以 compare 带有原始文件的输出文件来找出哪个 character/location 出了问题。
  • -d 对于 HTML,重新编码不转换 ASCII 字符。避免转换 < > " & 等 html 个字符。

更新 如果encoding/charset是iso-8859-1那么你需要使用:

$ recode -vfd iso-8859-1..html index.html
Request: ISO-8859-1..ISO-10646-UCS-2..HTML_4.0
Recoding index.html... done

#Or use following. 

$ recode -vfd lat1..html index.html
Request: ISO-8859-1..ISO-10646-UCS-2..HTML_4.0
Recoding index.html... done

ISO-8859-1 在重新编码中有以下别名:

l1 
lat1
latin1
Latin-1
819/CR-LF 
CP819/CR-LF 
CSISOLATIN1 
IBM819/CR-LF 
ISO8859-1 
iso-ir-100 
ISO_8859-1 
ISO_8859-1:1987

您可以在命令中使用以上任何一项。