在段落中查找匹配的字符串
Finding Matching Strings Within Paragraphs
我有一个包含 LaTeX 数学方程式的 TXT 文件,其中每个内联方程式前后使用单个 $ 分隔符。
我想在一个段落中找到每个等式,并将分隔符替换为 XML 打开和关闭标记....
例如,
以下段落:
This is the beginning of a paragraph $first equation$ ...and here is some text... $second equation$ ...and here is more text... $third equation$ ...and here is yet more text... $fourth equation$
应该变成:
This is the beginning of a paragraph <equation>first equation</equation> ...and here is some text... <equation>second equation</equation> ...and here is more text... <equation>third equation</equation> ...and here is yet more text... <equation>fourth equation</equation>
我已经尝试过如下的 sed 和 perl 命令:
perl -p -e 's/($)(.*[^$])($)/<equation><\/equation>/'
但是这些命令会导致第一个和最后一个方程实例被转换,但是 none 这两个方程之间的方程:
This is the beginning of a paragraph <equation>first equation$ ...and here is some text... $second equation$ ...and here is more text... $third equation$ ...and here is yet more text... $fourth equation</equation>
我还想要一个强大的解决方案,它可以考虑不用作 LaTeX 定界符的单个 $ 的存在。例如,
This is the beginning of a paragraph $first equation$ ...and here is some text that includes a single dollar sign: He paid .50 for a pack of cigarettes... $second equation$ ...and here is more text... $third equation$ ...and here is yet more text... $fourth equation$
不会变成:
This is the beginning of a paragraph <equation>first equation$ ...and here is some text that includes a single dollar sign: He paid <equation>2.50 for a pack of cigarettes... $second equation$ ...and here is more text... $third equation$ ...and here is yet more text... $fourth equation</equation>
注意:我是在Bash.
中写的
注意:本答案的第一部分仅关注替换 $'s
对;对于 OP 的请求 not 替换独立 $'s
...请参阅答案的第二部分。
替换成对的$'s
示例数据:
$ cat latex.txt
... $first equation$ ... $second equation$ ... $third equation$
一个sed
想法:
sed -E 's|$([^$]*)$|<equation></equation>|g' latex.txt
其中:
-E
- 启用扩展的正则表达式支持
$
- 匹配文字 $
([^$]*)
- [捕获组 #1] - 匹配所有非文字 $
的内容(在本例中为 $'s
对之间的所有内容)
$
- 匹配文字 $
<equation></equation>
- 将匹配的字符串替换为 <equation>
+ contents of capture group
+ </equation>
/g
- 根据需要经常重复 search/replace
这会生成:
... <equation>first equation</equation> ... <equation>second equation</equation> ... <equation>third equation</equation>
独立处理$
如果独立 $
可以转义(例如,$
),一个想法是让 sed
将其替换为无意义的文字,执行 <equation> / </equation>
替换,然后将无意义的文字改回 $
.
示例数据:
$ cat latex.txt
... $first equation$ ... $second equation$ ... $third equation$
... $first equation$ ... $3.50 cup of coffee ... $third equation$
原始 sed
解决方案与新的替代品:
sed -E 's|\$|LITDOL|g;s|$([^$]*)$|<equation></equation>|g;s|LITDOL|\$|g' latex.txt
我们将 $
替换为 LITDOL
(LITeral DOLlar),执行我们原来的替换,然后将 LITDOL
切换回 $
。
生成:
... <equation>first equation</equation> ... <equation>second equation</equation> ... <equation>third equation</equation>
... <equation>first equation</equation> ... $3.50 cup of coffee ... <equation>third equation</equation>
使用负前瞻试试这个 Perl。
$ cat joseph.txt
This is the beginning of a paragraph $first equation$ ...and here is some text that includes a single dollar sign: He paid .50 for a pack of cigarettes... $second equation$ ...and here is more text... $third equation$ ...and here is yet more text... $fourth equation$
$ perl -p -e 's/($)(?![\d.]+)(.+?)($)/<equation><\/equation>/g' joseph.txt
This is the beginning of a paragraph <equation>first equation</equation> ...and here is some text that includes a single dollar sign: He paid .50 for a pack of cigarettes... <equation>second equation</equation> ...and here is more text... <equation>third equation</equation> ...and here is yet more text... <equation>fourth equation</equation>
$
我有一个包含 LaTeX 数学方程式的 TXT 文件,其中每个内联方程式前后使用单个 $ 分隔符。
我想在一个段落中找到每个等式,并将分隔符替换为 XML 打开和关闭标记....
例如,
以下段落:
This is the beginning of a paragraph $first equation$ ...and here is some text... $second equation$ ...and here is more text... $third equation$ ...and here is yet more text... $fourth equation$
应该变成:
This is the beginning of a paragraph <equation>first equation</equation> ...and here is some text... <equation>second equation</equation> ...and here is more text... <equation>third equation</equation> ...and here is yet more text... <equation>fourth equation</equation>
我已经尝试过如下的 sed 和 perl 命令:
perl -p -e 's/($)(.*[^$])($)/<equation><\/equation>/'
但是这些命令会导致第一个和最后一个方程实例被转换,但是 none 这两个方程之间的方程:
This is the beginning of a paragraph <equation>first equation$ ...and here is some text... $second equation$ ...and here is more text... $third equation$ ...and here is yet more text... $fourth equation</equation>
我还想要一个强大的解决方案,它可以考虑不用作 LaTeX 定界符的单个 $ 的存在。例如,
This is the beginning of a paragraph $first equation$ ...and here is some text that includes a single dollar sign: He paid .50 for a pack of cigarettes... $second equation$ ...and here is more text... $third equation$ ...and here is yet more text... $fourth equation$
不会变成:
This is the beginning of a paragraph <equation>first equation$ ...and here is some text that includes a single dollar sign: He paid <equation>2.50 for a pack of cigarettes... $second equation$ ...and here is more text... $third equation$ ...and here is yet more text... $fourth equation</equation>
注意:我是在Bash.
中写的注意:本答案的第一部分仅关注替换 $'s
对;对于 OP 的请求 not 替换独立 $'s
...请参阅答案的第二部分。
替换成对的$'s
示例数据:
$ cat latex.txt
... $first equation$ ... $second equation$ ... $third equation$
一个sed
想法:
sed -E 's|$([^$]*)$|<equation></equation>|g' latex.txt
其中:
-E
- 启用扩展的正则表达式支持$
- 匹配文字$
([^$]*)
- [捕获组 #1] - 匹配所有非文字$
的内容(在本例中为$'s
对之间的所有内容)$
- 匹配文字$
<equation></equation>
- 将匹配的字符串替换为<equation>
+contents of capture group
+</equation>
/g
- 根据需要经常重复 search/replace
这会生成:
... <equation>first equation</equation> ... <equation>second equation</equation> ... <equation>third equation</equation>
独立处理$
如果独立 $
可以转义(例如,$
),一个想法是让 sed
将其替换为无意义的文字,执行 <equation> / </equation>
替换,然后将无意义的文字改回 $
.
示例数据:
$ cat latex.txt
... $first equation$ ... $second equation$ ... $third equation$
... $first equation$ ... $3.50 cup of coffee ... $third equation$
原始 sed
解决方案与新的替代品:
sed -E 's|\$|LITDOL|g;s|$([^$]*)$|<equation></equation>|g;s|LITDOL|\$|g' latex.txt
我们将 $
替换为 LITDOL
(LITeral DOLlar),执行我们原来的替换,然后将 LITDOL
切换回 $
。
生成:
... <equation>first equation</equation> ... <equation>second equation</equation> ... <equation>third equation</equation>
... <equation>first equation</equation> ... $3.50 cup of coffee ... <equation>third equation</equation>
使用负前瞻试试这个 Perl。
$ cat joseph.txt
This is the beginning of a paragraph $first equation$ ...and here is some text that includes a single dollar sign: He paid .50 for a pack of cigarettes... $second equation$ ...and here is more text... $third equation$ ...and here is yet more text... $fourth equation$
$ perl -p -e 's/($)(?![\d.]+)(.+?)($)/<equation><\/equation>/g' joseph.txt
This is the beginning of a paragraph <equation>first equation</equation> ...and here is some text that includes a single dollar sign: He paid .50 for a pack of cigarettes... <equation>second equation</equation> ...and here is more text... <equation>third equation</equation> ...and here is yet more text... <equation>fourth equation</equation>
$