正则表达式：匹配下划线包裹的单词，除非它们以 @ / # 开头

Question

我正在尝试通过传入自定义正则表达式来解决 Tiptap（Vue 的所见即所得编辑器）中的 this bug，以便在 Markdown (_value_) 中识别斜体符号的正则表达式将不适用于以 @ 或 # 开头的字符串，例如#some_tag_value 不会转换为 #sometagvalue.

到目前为止，这是我的正则表达式 - /(^|[^@#_\w])(?:\w?)(_([^_]+)_)/g
编辑：在@Wiktor Stribiżew /(^|[^@#_\w])(_([^_]+)_)/g

的帮助下新的正则表达式

虽然它满足了大多数常见情况，但目前仍然失败下划线是中间词，例如ant_farm_应该匹配(antfarm)

我在这里也提供了一些"should match"和"should not match"案例https://regexr.com/50ibf以便于测试

应该匹配（下划线之间）

_italic text here_
police_woman_
_fire_fighter
a thousand _words_
_brunch_ on a Sunday

不应该匹配

@ta_g_
__value__
#some_tag_value
@some_value_here
@some_tag_
#some_val_
#_hello_

Answer 1

对于科学来说，这个怪物在 Chrome（和 Node.js）中起作用。

let text = `
<strong>Should match</strong> (between underscores)

_italic text here_
police_woman_
_fire_fighter
a thousand _words_
_brunch_ on a Sunday

<strong>Should not match</strong>

@ta_g_
__value__
#some_tag_value
@some_value_here
@some_tag_
#some_val_
#_hello_
`;

let re = /(?<=(?:\s|^)(?![@#])[^_\n]*)_([^_]+)_/g;
document.querySelector('div').innerHTML = text.replace(re, '<em></em>');

div { white-space: pre; }

<div/>

这会将 _something_ 捕获为完全匹配，并将 something 作为第一个捕获组（以便删除下划线）。你不能只捕获 something，因为那样你就无法分辨下划线内部和外部的内容（用 (?<=(?:\s|^)(?![@#])[^_\n]*_)([^_]+)(?=_) 试试）。

有两点阻碍了它的普遍适用：

并非所有 JavaScript 引擎都支持后视
大多数正则表达式引擎不支持可变长度回顾

编辑：这有点强大，并且应该允许您另外 match_this_and_that_ but not @match_this_and_that 正确：

/(?<=(?:\s|^)(?![@#])(?!__)\S*)_([^_]+)_/

解释：

_([^_]+)_    Match non-underscory bit between two underscores
(?<=...)     that is preceded by
(?:\s|^)     either a whitespace or a start of a line/string
             (i.e. a proper word boundary, since we can't use `\b`)
\S*          and then some non-space characters
(?![@#])     that don't start with `@`, `#`,
(?!__)       or `__`.

regex101 demo

Answer 2

您可以使用以下模式：

(?:^|\s)[^@#\s_]*(_([^_]+)_)

见regex demo

详情

(?:^|\s) - 字符串或空格的开头
[^@#\s_]* - 除了 @、#、_ 和空格
(_([^_]+)_) - 第 1 组：_，除 _ 之外的 1+ 个字符（捕获到第 2 组），然后是 _.

Answer 3

这里有一些东西，它不像其他答案那么紧凑，但我认为它更容易理解发生了什么。匹配组 </code> 就是你想要的。 需要多行标志 <pre><code>^([a-zA-Z\s]+|_)(([a-zA-Z\s]+)_)+?[a-zA-Z\s]*?$

^ - 匹配行首
([a-zA-Z\s]+|_) - 多个单词或 _
(([a-zA-Z\s]+)_)+? - 多个单词后跟 _ 至少一次，但最少匹配。
[a-zA-Z\s]*? - 最后的话
$ - 行尾

综上所述细目匹配的事物之一

_<words>_
<words>_<words>_
<words>_<words>_<words>
_<words>_<words>

正则表达式：匹配下划线包裹的单词，除非它们以 @ / # 开头

Regex: match underscore-wrapped words unless they start with @ / #

javascript

regex

markdown

tiptap