从文本中提取和拆分单词，并仅使用 shell 终端正则表达式按出现顺序列出它们

Question

下面有这段文字（采用这种格式），我希望将单词分开并按照它们在垂直列表中出现的顺序一个一个地放置，就像这个例子。我尝试 egrep -vi "'?[^\p{L}']+'?|^'|'$" mytext.txt > output.txt 但我没有得到任何结果，只是 output.txt 没有（空）内容。

我的文字：

Teaching psychology is the part of education psychology that refers to school education. As will be seen later, both have the same goal: study, explain and understand the processes of behavioral change that are produce in people as a consequence of their participation in activities educational What gives an entity proper to teaching psychology is the nature and the characteristics of the educational activities that exist at the base of the of behavioral change studied.

我的葡萄牙语文本：

A psicologia do ensino é a parte da psicologia da educacão que se refere à educacão escolar. Como se verá mais adiante, ambas têm um mesmo objetivo: estudar, explicar e compreender os processos de mudanca comportamental que se produzem nas pessoas como uma conseqüência da sua participacão em atividades educativas. O que confere uma entidade própria à psicologia do ensino é a natureza e as caracterís- ticas das atividades educativas que existem na base dos processos de mudanca comportamental estudados.

Answer 1

您可能希望通过空格标记文本：

grep -o '[^[:space:]][^[:space:]]*' mytext.txt > output.txt
grep -o '[^[:space:]]\{1,\}' mytext.txt > output.txt
grep -oE '[^[:space:]]+' mytext.txt > output.txt

或者，您可以使用 PCRE 正则表达式提取所有 1+ 个字母 (\p{L})、变音符号 (\p{M}) 和数字 (\p{N}) 的块，例如：

grep -oP '[\p{L}\p{M}\p{N}]+'  mytext.txt > output.txt

请参阅 MacOS 上的 online demo. You will need pcregrep 以使其正常工作。

从文本中提取和拆分单词，并仅使用 shell 终端正则表达式按出现顺序列出它们

Extract and split words from text and list them in order of occurrence using only shell terminal regex

regex

shell

split

list

word