在 C# 中解码西里尔 HTML 实体

Decode cyrillic HTML entities in C#

我使用 HtmlAgilityPack 从网站上获取了一些 string,其中包含 HTML 个 西里尔字母

的实体

示例:

"Корпус"

在保存到文件时,有没有办法将它解码成C#中的符号?我尝试使用 HttpUtility.HtmlDecodeSystem.WebWebUtility.HtmlDecode,但 没有帮助

我的尝试:

using System;
using System.Web;

namespace esp
{
    class MainClass
    {
        public static void Main(string[] args)
        {
            body = "Корпус";

            //output will be "Корпус"
            Console.WriteLine(HttpUtility.HtmlDecode(body)); 
        }
    }
}

只是猜测。据我所知,我们有以下格式:

  &
   Letter(s) - transliterated letter 
   cy        - stands for Cyrillic 
  ; 

我们可以在正则表达式的帮助下匹配所有字母,然后Concat将它们[=17] =] 例如

  using System.Text.RegularExpressions;

  ...

  string body = "Корпус";

  var transliteratedText = Regex.Replace(
         body, 
       @"&(?<letter>[A-Za-z]+)cy;",
         m => m.Groups["letter"].Value);

  Console.Write(transliteratedText);

我们将有

Korpus

这听起来很合理,因为它是 transliterated 俄语单词 Корпус (Corpus, Body, Bulk, Carcass).有 几个 音译标准(我试过 国会图书馆 方案,它只是最受欢迎的方案之一);为了检测正确的标准(或创建我们自己的标准),我们需要更多数据。

编辑 例如,如果我们有一个方案,比如说,

private static Dictionary<string, string> translit = 
  new Dictionary<string, string>(StringComparer.OrdinalIgnoreCase) {
  {"a", "а"},
  {"b", "б"},
  {"v", "в"},
  {"g", "г"},
  {"d", "д"},
  {"ie", "е"},
  //{"", "ё"}, //TODO: define the letter transliteration
  {"zh", "ж"},
  {"z", "з"},
  {"i", "и"},
  {"j", "й"},
  {"k", "к"},
  {"l", "л"},
  {"m", "м"},
  {"n", "н"},
  {"o", "о"},
  {"p", "п"},
  {"r", "р"},
  {"s", "с"},
  {"t", "т"},
  {"u", "у"},
  {"f", "ф"},
  {"h", "х"},
  {"ts", "ц"},
  {"ch", "ч"},
  {"sh", "ш"},
  {"shch", "щ"},
  //{"", "ъ"}, //TODO: define the letter transliteration
  {"y", "ы"},
  //{"", "ь"}, //TODO: define the letter transliteration
  //{"", "э"}, //TODO: define the letter transliteration
  //{"", "ю"}, //TODO: define the letter transliteration
  {"ya", "я"},
};

我们可以音译每个字母:

private static string MyDecoding(string value) {
  return Regex
    .Replace(value, @"&(?<letter>[A-Za-z]+)cy;", m => {
      string v = m.Groups["letter"].Value;

      return char.IsUpper(v[0])
        ? CultureInfo.InvariantCulture.TextInfo.ToTitleCase(translit[v])
        : translit[v];
      }
    );
}
...

Console.Write(MyDecoding("&Kcy;&ocy;&rcy;&pcy;&ucy;&scy;"));

结果:

Корпус