通过 shell 终端仅使用正则表达式(从文本中)仅提取非重复词

extract (from text) only non-repeating words using only regex via shell terminal

我只想提取在下面的文本中不重复的单词。我只是想使用正则表达式,我看到了一些与 Only extract those words from a list that include no repeating letters, using regex (don't repeat letters) and Regular Expression :match string containing only non repeating words 中类似的问题。我希望结果是一个单词列表,这些单词不会按照它们在文本中出现的自然顺序重复出现。

我的常用格式文本:

Teaching psychology is the part of educational psychology that refers to school education. As will be seen later, both have the same objective: to study, explain and understand the processes of behavioral change that are produce in people as a consequence of their participation in activities educational What gives an entity proper to teaching psychology is the nature and the characteristics of the educational activities that exist at the base of the of behavioral change studied.

我的文字在垂直列表中逐字分开(如果这样使用更容易)使用

的答案

如果您需要纯正则表达式解决方案,您只能使用 .NET 或 Python PyPi 正则表达式来实现,因为您需要正则表达式库通常不具备的两个功能:1) 从右到左的输入字符串解析和 2) 无限宽度后视。

这是一个Python解决方案:

import regex
text="Teaching psychology is the part of educational psychology that refers to school education. As will be seen later, both have the same objective: to study, explain and understand the processes of behavioral change that are produce in people as a consequence of their participation in activities educational What gives an entity proper to teaching psychology is the nature and the characteristics of the educational activities that exist at the base of the of behavioral change studied."
rx = r'(?rus)(?<!\b\b.*?)\b(\w+)\b'
print (list(reversed(regex.findall(rx, text))))

看到一个online demo

详情

  • (?rus) - r 启用从右到左的输入字符串解析(正则表达式中的所有模式照常从左到右匹配,因此匹配文本不会颠倒),[= Python 中的 13=] 2 用于使 \w 识别 Unicode,它是 Python 中的默认选项 3、s 是使 . 的 DOTALL 修饰符匹配换行符
  • (?<!\b\b.*?) - 如果紧挨着当前位置的左侧有任何 0+ 个字符,然后是与第 1 组中捕获的相同文本(见后面的表达式)作为整个单词,则不匹配
  • \b(\w+)\b - 一个完整的单词,单词边界内有 1 个以上的单词字符。

reversed 用于按原始顺序打印单词,因为从右到左的正则表达式从头到尾匹配它们。