什么是“git diff --word-diff”默认正则表达式?

What are `git diff --word-diff' default regexps?

git diff 具有匹配单词的选项 --word-diff-regex=<...>。某些语言有特殊的默认值(如 man 5 gitattributes 中所述)。但这些是什么?文档中没有描述,我查找了 git 的来源,也没有找到它们。

有什么想法吗?

编辑:我在 git 1.9.1,但我会接受任何版本的答案。

来源包含 userdiff.c 文件中的默认单词正则表达式。 PATTERNSIPATTERN 宏将基本单词正则表达式作为它们的第三个参数,并添加 "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+" 以确保所有不属于较大单词的非空白字符都被视为一个词本身,并假设为 UTF-8,而不拆分多字节字符。例如,在:

PATTERNS("tex", "^(\\((sub)*section|chapter|part)\*{0,1}\{.*)$",
         "\\[a-zA-Z@]+|\\.|[a-zA-Z0-9\x80-\xff]+"),

正则表达式是 "\\[a-zA-Z@]+|\\.|[a-zA-Z0-9\x80-\xff]+|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+"

在这种情况下,|[\xc0-\xff][\x80-\xbf]+ 恰好没有任何好处,因为 [\xc0-\xff][\x80-\xbf]+ 涵盖的所有内容都已包含在 [a-zA-Z0-9\x80-\xff]+ 中,但它也不会造成任何伤害.

docs for .gitattributes 中给出了预定义的差异驱动程序列表(它们都有预定义的单词差异正则表达式)。进一步说明

you still need to enable this with the attribute mechanism, via .gitattributes

因此,要激活 hvd 对所有 *.tex 文件的回答中显示的 tex 模式,您可以在项目根目录中发出以下命令(省略 Windows 下的引号):

echo '*.tex diff=tex' >> .gitattributes

注意:关于这些模式,Git 2.34(2021 年第 4 季度)更加清晰,并提醒开发人员 userdiff 模式应保持简单和宽松,假设它们应用的内容始终是语法正确。

参见 commit b6029b3 (10 Aug 2021) by Junio C Hamano (gitster)
(由 Junio C Hamano -- gitster -- in commit e1eb133 合并,2021 年 8 月 30 日)

userdiff: comment on the builtin patterns

Remind developers that they do not need to go overboard to implement patterns to prepare for invalid constructs.
They only have to be sufficiently permissive, assuming that the payload is syntactically correct, and that may allow them to be simpler.

Text stolen mostly from, and further improved by, Johannes Sixt.

所以那些内置模式现在有评论:

/*
 * Built-in drivers for various languages, sorted by their names
 * (except that the "default" is left at the end).
 *
 * When writing or updating patterns, assume that the contents these
 * patterns are applied to are syntactically correct.  The patterns
 * can be simple without implementing all syntactical corner cases, as
 * long as they are sufficiently permissive.
 */
static struct userdiff_driver builtin_drivers[] = {