将所有法语口音转换为 HTML 字符格式
Convert all french accents into HTML character format
例如,我有一堆 HTML 这样的页面:
<!DOCTYPE html>
<html>
<head><title>Table des matières</title>
<meta http-equiv="Content-Type" content="text/html; charset="utf-8"" />
<meta name="generator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<meta name="originator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<!-- 3,html,xhtml,charset="utf-8" -->
<meta name="src" content="content_final.tex" />
<link rel="stylesheet" type="text/css" href="content_final.css" />
<script type="text/javascript" src="./jquery.js">
</script>
<script type="text/javascript">
$(document).ready(function() {
function capitalizeFirstLetter(string) {
return string.charAt(0).toUpperCase() + string.slice(1).toLowerCase();
}
$('div.caption span.id').each(function() { var result = $(this).text().replace(':','');
result=capitalizeFirstLetter(result);
$(this).text(result);
});
});
</script>
</head><body
>
<!--l. 125--><div class="crosslinks"><p class="noindent">[<a
href="chapter1.html" >next</a>] [<a
href="#tailcontent.html">tail</a>] [<a
href="/sciences/index.html" >up</a>] </p></div>
<h2 class="likechapterHead"><a
id="x2-1000"></a>Table des matières</h2>
<div class="tableofcontents">
但不可能像上面的重音一样转换这些 HTML 页面中的所有法语口音
“Table des matières
”出现“è
”而不是“è
”。
我尝试了两件事:
for i in $(ls *.html); do iconv -f iso-8859-1 -t utf8 $i > $i"_new"; mv -f $i"_new" $i; done
=> 重音未转换
for i in $(ls *.html); do recode ..html $i; done
=> 我有以下错误:
recode: section5.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
recode: section6.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
recode: section7.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
recode: section8.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
recode: section9.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
recode: table_of_contents.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
...
我不知道如何转换所有这些法国口音?
有没有人有转换所有可能的法语口音的想法或建议?我想使用 iconv
、recode
或 sed
命令。
更新 1: 以一个基本示例为例,这是我收到的单个文件的消息:
$ recode ..html table_of_contents.html
recode: table_of_contents.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
怎么了?
更新 2: 这是我原来的 HTML 页的输出:
$file -i index.html
$ index.html: text/x-tex; charset=iso-8859-1
和index.html
的负责人:
<!DOCTYPE html>
<html>
<head><title>Table des matières</title>
<meta http-equiv="Content-Type" content="text/html; charset="utf-8"" />
<meta name="generator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<meta name="originator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<!-- 3,html,xhtml,charset="utf-8" -->
<meta name="src" content="content_final.tex" />
<link rel="stylesheet" type="text/css" href="content_final.css" />
<script type="text/javascript" src="./jquery.js">
</script>
<script type="text/javascript">
$(document).ready(function() {
function capitalizeFirstLetter(string) {
return string.charAt(0).toUpperCase() + string.slice(1).toLowerCase();
}
$('div.caption span.id').each(function() { var result = $(this).text().replace(':','');
result=capitalizeFirstLetter(result);
$(this).text(result);
如果我应用命令:
$ recode -vfd u8..html index.html
Request: UTF-8..:libiconv:..ISO-10646-UCS-2..HTML_4.0
Shrunk to: UTF-8..ISO-10646-UCS-2..HTML_4.0
Recoding index.html... done
和
<!DOCTYPE html>
<html>
<head><title>Table des matires</title>
<meta http-equiv="Content-Type" content="text/html; charset="utf-8"" />
<meta name="generator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<meta name="originator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<!-- 3,html,xhtml,charset="utf-8" -->
<meta name="src" content="content_final.tex" />
<link rel="stylesheet" type="text/css" href="content_final.css" />
<script type="text/javascript" src="./jquery.js">
</script>
<script type="text/javascript">
$(document).ready(function() {
function capitalizeFirstLetter(string) {
return string.charAt(0).toUpperCase() + string.slice(1).toLowerCase();
}
$('div.caption span.id').each(function() { var result = $(this).text().replace(':','');
result=capitalizeFirstLetter(result);
$(this).text(result);
});
});
</script>
如您所见,“è
”消失了。
我能做什么?
假设源文件编码为UTF-8。以下命令在我的环境中有效:
$ recode -vfd u8..html index.html
输出:
$ locale charmap
UTF-8
$ file -i index.html
index.html: text/html; charset=utf-8
$ recode -vfd u8..html index.html
Request: UTF-8..:iconv:..ISO-10646-UCS-2..HTML_4.0
Shrunk to: UTF-8..ISO-10646-UCS-2..HTML_4.0
Recoding index.html... done
您可以使用命令选项通过这种方式调试错误:
-v
详细输出。有助于查找错误发生在哪一步。
-f
即使发生错误也强制完成。您可以 compare 带有原始文件的输出文件来找出哪个 character/location 出了问题。
-d
对于 HTML,重新编码不转换 ASCII 字符。避免转换 < > " &
等 html 个字符。
更新 如果encoding/charset是iso-8859-1
那么你需要使用:
$ recode -vfd iso-8859-1..html index.html
Request: ISO-8859-1..ISO-10646-UCS-2..HTML_4.0
Recoding index.html... done
#Or use following.
$ recode -vfd lat1..html index.html
Request: ISO-8859-1..ISO-10646-UCS-2..HTML_4.0
Recoding index.html... done
ISO-8859-1
在重新编码中有以下别名:
l1
lat1
latin1
Latin-1
819/CR-LF
CP819/CR-LF
CSISOLATIN1
IBM819/CR-LF
ISO8859-1
iso-ir-100
ISO_8859-1
ISO_8859-1:1987
您可以在命令中使用以上任何一项。
例如,我有一堆 HTML 这样的页面:
<!DOCTYPE html>
<html>
<head><title>Table des matières</title>
<meta http-equiv="Content-Type" content="text/html; charset="utf-8"" />
<meta name="generator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<meta name="originator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<!-- 3,html,xhtml,charset="utf-8" -->
<meta name="src" content="content_final.tex" />
<link rel="stylesheet" type="text/css" href="content_final.css" />
<script type="text/javascript" src="./jquery.js">
</script>
<script type="text/javascript">
$(document).ready(function() {
function capitalizeFirstLetter(string) {
return string.charAt(0).toUpperCase() + string.slice(1).toLowerCase();
}
$('div.caption span.id').each(function() { var result = $(this).text().replace(':','');
result=capitalizeFirstLetter(result);
$(this).text(result);
});
});
</script>
</head><body
>
<!--l. 125--><div class="crosslinks"><p class="noindent">[<a
href="chapter1.html" >next</a>] [<a
href="#tailcontent.html">tail</a>] [<a
href="/sciences/index.html" >up</a>] </p></div>
<h2 class="likechapterHead"><a
id="x2-1000"></a>Table des matières</h2>
<div class="tableofcontents">
但不可能像上面的重音一样转换这些 HTML 页面中的所有法语口音
“Table des matières
”出现“è
”而不是“è
”。
我尝试了两件事:
for i in $(ls *.html); do iconv -f iso-8859-1 -t utf8 $i > $i"_new"; mv -f $i"_new" $i; done
=> 重音未转换
for i in $(ls *.html); do recode ..html $i; done
=> 我有以下错误:
recode: section5.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
recode: section6.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
recode: section7.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
recode: section8.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
recode: section9.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
recode: table_of_contents.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
...
我不知道如何转换所有这些法国口音?
有没有人有转换所有可能的法语口音的想法或建议?我想使用 iconv
、recode
或 sed
命令。
更新 1: 以一个基本示例为例,这是我收到的单个文件的消息:
$ recode ..html table_of_contents.html
recode: table_of_contents.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
怎么了?
更新 2: 这是我原来的 HTML 页的输出:
$file -i index.html
$ index.html: text/x-tex; charset=iso-8859-1
和index.html
的负责人:
<!DOCTYPE html>
<html>
<head><title>Table des matières</title>
<meta http-equiv="Content-Type" content="text/html; charset="utf-8"" />
<meta name="generator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<meta name="originator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<!-- 3,html,xhtml,charset="utf-8" -->
<meta name="src" content="content_final.tex" />
<link rel="stylesheet" type="text/css" href="content_final.css" />
<script type="text/javascript" src="./jquery.js">
</script>
<script type="text/javascript">
$(document).ready(function() {
function capitalizeFirstLetter(string) {
return string.charAt(0).toUpperCase() + string.slice(1).toLowerCase();
}
$('div.caption span.id').each(function() { var result = $(this).text().replace(':','');
result=capitalizeFirstLetter(result);
$(this).text(result);
如果我应用命令:
$ recode -vfd u8..html index.html
Request: UTF-8..:libiconv:..ISO-10646-UCS-2..HTML_4.0
Shrunk to: UTF-8..ISO-10646-UCS-2..HTML_4.0
Recoding index.html... done
和
<!DOCTYPE html>
<html>
<head><title>Table des matires</title>
<meta http-equiv="Content-Type" content="text/html; charset="utf-8"" />
<meta name="generator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<meta name="originator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<!-- 3,html,xhtml,charset="utf-8" -->
<meta name="src" content="content_final.tex" />
<link rel="stylesheet" type="text/css" href="content_final.css" />
<script type="text/javascript" src="./jquery.js">
</script>
<script type="text/javascript">
$(document).ready(function() {
function capitalizeFirstLetter(string) {
return string.charAt(0).toUpperCase() + string.slice(1).toLowerCase();
}
$('div.caption span.id').each(function() { var result = $(this).text().replace(':','');
result=capitalizeFirstLetter(result);
$(this).text(result);
});
});
</script>
如您所见,“è
”消失了。
我能做什么?
假设源文件编码为UTF-8。以下命令在我的环境中有效:
$ recode -vfd u8..html index.html
输出:
$ locale charmap
UTF-8
$ file -i index.html
index.html: text/html; charset=utf-8
$ recode -vfd u8..html index.html
Request: UTF-8..:iconv:..ISO-10646-UCS-2..HTML_4.0
Shrunk to: UTF-8..ISO-10646-UCS-2..HTML_4.0
Recoding index.html... done
您可以使用命令选项通过这种方式调试错误:
-v
详细输出。有助于查找错误发生在哪一步。-f
即使发生错误也强制完成。您可以 compare 带有原始文件的输出文件来找出哪个 character/location 出了问题。-d
对于 HTML,重新编码不转换 ASCII 字符。避免转换< > " &
等 html 个字符。
更新 如果encoding/charset是iso-8859-1
那么你需要使用:
$ recode -vfd iso-8859-1..html index.html
Request: ISO-8859-1..ISO-10646-UCS-2..HTML_4.0
Recoding index.html... done
#Or use following.
$ recode -vfd lat1..html index.html
Request: ISO-8859-1..ISO-10646-UCS-2..HTML_4.0
Recoding index.html... done
ISO-8859-1
在重新编码中有以下别名:
l1
lat1
latin1
Latin-1
819/CR-LF
CP819/CR-LF
CSISOLATIN1
IBM819/CR-LF
ISO8859-1
iso-ir-100
ISO_8859-1
ISO_8859-1:1987
您可以在命令中使用以上任何一项。