在标签之间包含希腊句子或查找并替换包含希腊字符的句子部分

Question

我正在尝试找到一种方法来将使用希腊字符的句子括在特殊标签中（在本例中为 LaTeX，但这无关紧要）。所以给定我的输入文本：

inputtext = "some english text ῍Ενθεσις τοῦ Ψαλτής and then english again"

我想实现这个：

results = "some english text \textgreek{῍Ενθεσις τοῦ Ψαλτής} and then english again"

几个小时后，我想出了这个几乎可行的解决方案：

import re 
inputtext = "some english text ῍Ενθεσις τοῦ Ψαλτής and then english again" 
t = re.findall('[α-ωΑ-Ω]',inputtext) 
beg = inputtext.find(t[0]) 
end = inputtext.rfind(t[-1]) + 1 
results = "".join((inputtext[:beg]+"\textgreek{"+inputtext[beg:end]+"}"+inputtext[end:]))


In [50]: results                                                                                  
Out[50]: 'some english text ῍\textgreek{Ενθεσις τοῦ Ψαλτής} and then english again'

然后我想到了话题性的问题，有没有更好的解决方案？也许只使用正则表达式？ 目前的解决方案似乎忽略了希腊语的多调字符῍，当然只有当每个句子有一个希腊句子时它才有效。

Answer 1

使用 regex 模块：

>>> s = "some english text ῍Ενθεσις τοῦ Ψαλτής and then english again"
>>> regex.sub(r'\p{Greek}+(\s+\p{Greek}+)*', r'\textgreek{\g<0>}', s)
'some english text \textgreek{῍Ενθεσις τοῦ Ψαλτής} and then english again'

这是基于给定的示例，不确定要如何处理非希腊标点符号等字符。

如果Greek_and_Coptic and Greek_Extended有你想要匹配的所有字符，那么你可以手动构造字符范围，从而使用re模块本身：

>>> s = "some english text ῍Ενθεσις τοῦ Ψαλτής and then english again"
>>> re.sub(r'[\u0370-\u03ff\u1f00-\u1fff]+(\s+[\u0370-\u03ff\u1f00-\u1fff]+)*', r'\textgreek{\g<0>}', s)
'some english text \textgreek{῍Ενθεσις τοῦ Ψαλτής} and then english again'

在标签之间包含希腊句子或查找并替换包含希腊字符的句子部分

Enclose greek sentence between tag or find and replace part of sentence that contains greek characters

python

regex

unicode