同时标记匹配和转义匹配外的特殊字符

Question

我有一个树状视图的搜索功能，可以突出显示所有匹配项，包括。区分大小写和区分大小写，以及区分正则表达式和文字。但是，当当前单元格包含不属于匹配项的特殊字符时，我遇到了问题。考虑树视图单元格中的以下文本：

father & mother

现在我想在整个树视图中搜索字母 'e'。为了仅突出显示匹配项而不是整个单元格，我需要使用标记。为此，我使用 g_regex_replace_eval 及其回调函数 in the way as stated inside the GLib documentation。为单元格生成的新标记文本将如下所示：

fath<span background='yellow' foreground='black'>e</span>r & 
moth<span background='yellow' foreground='black'>e</span>r

如果匹配中有特殊字符，它们会在添加到 eval 函数使用的哈希表之前进行转义。所以特殊字符里面匹配是没有问题的。

但我现在在标记部分之外有'&'，必须将其更改为&，否则标记将不会显示在单元格和警告

Failed to set text from markup due to error parsing markup: Error on line x: Entity did not end with a semicolon; most likely you used an ampersand character without intending to start an entity - escape ampersand as &

将显示在终端内。

如果我在新的单元格文本上使用g_markup_escape_text，显然它不仅会转义'&'，还会转义标记的'<'和'>'，所以这不是解决方案。

是否有一种合理的方法可以在匹配项周围放置标记并同时或通过查看步骤转义标记外的特殊字符？到目前为止，我能想到的一切都太复杂了，如果它能奏效的话。

尽管我在提问之前已经考虑了 Philip 的大部分建议，但我还没有触及 utf8 的主题，所以他给出了解决方案的重要提示。以下是工作实施的核心：

gchar *counter_char = original_cell_txt; // counter_char will move through all the characters of original_cell_txt.
gint counter;

gunichar unichar;
gchar utf8_char[6]; // Six bytes is the buffer size needed later by g_unichar_to_utf8 (). 
gint utf8_length;
gchar *utf8_escaped;

enum { START_POS, END_POS };
GArray *positions[2];
positions[START_POS] = g_array_new (FALSE, FALSE, sizeof (gint));
positions[END_POS] = g_array_new (FALSE, FALSE, sizeof (gint));
gint start_position, end_position;

txt_with_markup = g_string_new ("");    

g_regex_match (regex, original_cell_txt, 0, &match_info);

while (g_match_info_matches (match_info)) {
    g_match_info_fetch_pos (match_info, 0, &start_position, &end_position);
    g_array_append_val (positions[START_POS], start_position);
    g_array_append_val (positions[END_POS], end_position);
    g_match_info_next (match_info, NULL);
}

do {
    unichar = g_utf8_get_char (counter_char);
    counter = counter_char - original_cell_txt; // pointer arithmetic

    if (counter == g_array_index (positions[END_POS], gint, 0)) {
        txt_with_markup = g_string_append (txt_with_markup, "</span>");
        // It's simpler to always access the first element instead of looping through the whole array.
        g_array_remove_index (positions[END_POS], 0);
     }
     /*
         No "else if" is used here, since if there is a search for a single character going on and  
         such a character appears double as 'm' in "command", between both m's a span tag has to be 
         closed and opened at the same position.
     */
     if (counter == g_array_index (positions[START_POS], gint, 0)) {
         txt_with_markup = g_string_append (txt_with_markup, "<span background='yellow' foreground='black'>");
         // See the comment for the similar instruction above.
         g_array_remove_index (positions[START_POS], 0);
     }

     utf8_length = g_unichar_to_utf8 (unichar, utf8_char);
     /*
         Instead of using a switch statement to check whether the current character needs to be escaped, 
         for simplicity the character is sent to the escape function regardless of whether there will be 
         any escaping done by it or not.
     */
     utf8_escaped = g_markup_escape_text (utf8_char, utf8_length);

     txt_with_markup = g_string_append (txt_with_markup, utf8_escaped);

     // Cleanup
     g_free (utf8_escaped);

     counter_char = g_utf8_find_next_char (counter_char, NULL);
} while (*counter_char != '[=13=]');

/*
    There is a '</span>' to set at the end; because the end position is one position after the string size
    this couldn't be done inside the preceding loop.
*/            
if (positions[END_POS]->len) {
    g_string_append (txt_with_markup, "</span>");
}

g_object_set (txt_renderer, "markup", txt_with_markup->str, NULL);

// Cleanup
g_regex_unref (regex);
g_match_info_free (match_info);
g_array_free (positions[START_POS], TRUE);
g_array_free (positions[END_POS], TRUE);

Answer 1

可能这样做的方法是不使用 g_regex_replace_eval()，而是使用 g_regex_match_all() 来获取字符串的匹配列表。然后您需要逐个字符地遍历字符串（使用 g_utf8_*() 函数执行此操作，因为这必须是 Unicode 感知的）。如果遇到需要转义的字符（<、>、&、"、'），输出转义后的实体。当您到达匹配位置时，为其输出正确的标记。

Answer 2

我会先使用 g_markup_escape_text 对整个文本进行转义，然后对文本进行转义以在 g_regex_replace_eval 中进行搜索和使用。这样可以匹配转义的文字，不匹配的文字已经转义了。

同时标记匹配和转义匹配外的特殊字符

Markup matches and escape special characters outside the matches at the same time

c

regex

gtk

glib

gtktreeview