Python: 将 SSML 与 SAPI (comtypes) 结合使用
Python: Using SSML with SAPI (comtypes)
TL;DR: 我正在尝试将 XML 对象(使用 ET)传递给 python 3.7 中的 Comtypes (SAPI) 对象.2 on Windows 10. 由于无效字符而失败(参见下面的错误)。从文件中正确读取 Unicode 字符,可以打印(但不能在控制台上正确显示)。似乎 XML 正在作为 ASCII 传递,或者我缺少一个标志? (https://docs.microsoft.com/en-us/previous-versions/windows/desktop/ee431843(v%3Dvs.85))。如果它是一个丢失的标志,我该如何传递它? (我还没有想出那部分..)
详细描述
我在 Windows 10 上使用 Python 3.7.2 并尝试发送创建一个 XML (SSML: https://www.w3.org/TR/speech-synthesis/) file to use with Microsoft's speech API. The voice struggles with certain words and when I looked at the SSML format and it supports a phoneme tag, which allows you to specify how to pronounce a given word. Microsoft implements parts of the standard (https://docs.microsoft.com/en-us/cortana/skills/speech-synthesis-markup-language#phoneme-element) 所以我找到了一个 UTF-包含 IPA 发音的 8 编码库。当我尝试调用 SAPI 并替换部分代码时,出现以下错误:
Traceback (most recent call last):
File "pdf_to_speech.py", line 132, in <module>
audioConverter(text = "Hello world extended test",outputFile = output_file)
File "pdf_to_speech.py", line 88, in __call__
self.engine.speak(text)
_ctypes.COMError: (-2147200902, None, ("'ph' attribute in 'phoneme' element is not valid.", None, None, 0, None))
我一直在尝试调试,但是当我打印单词的发音时,字符是方框。但是,如果我从我的控制台复制并粘贴它们,它们看起来很好(见下文)。
həˈloʊ,
ˈwɝːld
ɪkˈstɛndəd,
ˈtɛst
最佳猜测
我不确定问题是否由
1) 我已经更改了 pythons 的版本以便能够打印 unicode
2)我解决了读取文件的问题
3) 我对字符串的操作不正确
我很确定问题是我没有将它作为 unicode 传递给 comtype 对象。我正在研究的想法是
1)是否缺少标志?
2)当它被传递给 comtypes(C 类型错误)时,它是否被转换为 ascii?
3) XML 是否传递错误/我是否遗漏了一步?
先睹为快代码
这是读取IPA字典然后生成XML文件的class。看看 _load_phonemes 和 _pronounce.
class SSML_Generator:
def __init__(self,pause,phonemeFile):
self.pause = pause
if isinstance(phonemeFile,str):
print("Loading dictionary")
self.phonemeDict = self._load_phonemes(phonemeFile)
print(len(self.phonemeDict))
else:
self.phonemeDict = {}
def _load_phonemes(self, phonemeFile):
phonemeDict = {}
with io.open(phonemeFile, 'r',encoding='utf-8') as f:
for line in f:
tok = line.split()
#print(len(tok))
phonemeDict[tok[0].lower()] = tok[1].lower()
return phonemeDict
def __call__(self,text):
SSML_document = self._header()
for utterance in text:
parent_tag = self._pronounce(utterance,SSML_document)
#parent_tag.tail = self._pause(parent_tag)
SSML_document.append(parent_tag)
ET.dump(SSML_document)
return SSML_document
def _pause(self,parent_tag):
return ET.fromstring("<break time=\"150ms\" />") # ET.SubElement(parent_tag,"break",{"time":str(self.pause)+"ms"})
def _header(self):
return ET.Element("speak",{"version":"1.0", "xmlns":"http://www.w3.org/2001/10/synthesis", "xml:lang":"en-US"})
# TODO: Add rate https://docs.microsoft.com/en-us/cortana/skills/speech-synthesis-markup-language#prosody-element
def _rate(self):
pass
# TODO: Add pitch
def _pitch(self):
pass
def _pronounce(self,word,parent_tag):
if word in self.phonemeDict:
sys.stdout.buffer.write(self.phonemeDict[word].encode("utf-8"))
return ET.fromstring("<phoneme alphabet=\"ipa\" ph=\"" + self.phonemeDict[word] + "\"> </phoneme>")#ET.SubElement(parent_tag,"phoneme",{"alphabet":"ipa","ph":self.phonemeDict[word]})#<phoneme alphabet="string" ph="string"></phoneme>
else:
return parent_tag
# Nice to have: Transform acronyms into their pronunciation (See say as tag)
我还添加了代码如何写入 comtype 对象 (SAPI),以防出现错误。
def __call__(self,text,outputFile):
# https://docs.microsoft.com/en-us/previous-versions/windows/desktop/ms723606(v%3Dvs.85)
self.stream.Open(outputFile + ".wav", self.SpeechLib.SSFMCreateForWrite)
self.engine.AudioOutputStream = self.stream
text = self._text_processing(text)
text = self.SSML_generator(text)
text = ET.tostring(text,encoding='utf8', method='xml').decode('utf-8')
self.engine.speak(text)
self.stream.Close()
在此先感谢您的帮助!
尽量在ph属性中使用单引号。
像这样
my_text = '<speak><phoneme alphabet="x-sampa" ph=\'v"e.de.ni.e\'>ведение</phoneme></speak>'
还记得用\转义单引号
UPD
此错误也可能意味着无法解析您的 ph。您可以在那里查看文档:https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/speech-synthesis-markup
这个例子可以工作
<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="en-US-Jessa24kRUS">
<s>His name is Mike <phoneme alphabet="ups" ph="JH AU"> Zhou </phoneme></s>
</voice>
</speak>
但这不是
<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="en-US-Jessa24kRUS">
<s>His name is Mike <phoneme alphabet="ups" ph="JHU AUA"> Zhou </phoneme></s>
</voice>
</speak>
TL;DR: 我正在尝试将 XML 对象(使用 ET)传递给 python 3.7 中的 Comtypes (SAPI) 对象.2 on Windows 10. 由于无效字符而失败(参见下面的错误)。从文件中正确读取 Unicode 字符,可以打印(但不能在控制台上正确显示)。似乎 XML 正在作为 ASCII 传递,或者我缺少一个标志? (https://docs.microsoft.com/en-us/previous-versions/windows/desktop/ee431843(v%3Dvs.85))。如果它是一个丢失的标志,我该如何传递它? (我还没有想出那部分..)
详细描述
我在 Windows 10 上使用 Python 3.7.2 并尝试发送创建一个 XML (SSML: https://www.w3.org/TR/speech-synthesis/) file to use with Microsoft's speech API. The voice struggles with certain words and when I looked at the SSML format and it supports a phoneme tag, which allows you to specify how to pronounce a given word. Microsoft implements parts of the standard (https://docs.microsoft.com/en-us/cortana/skills/speech-synthesis-markup-language#phoneme-element) 所以我找到了一个 UTF-包含 IPA 发音的 8 编码库。当我尝试调用 SAPI 并替换部分代码时,出现以下错误:
Traceback (most recent call last):
File "pdf_to_speech.py", line 132, in <module>
audioConverter(text = "Hello world extended test",outputFile = output_file)
File "pdf_to_speech.py", line 88, in __call__
self.engine.speak(text)
_ctypes.COMError: (-2147200902, None, ("'ph' attribute in 'phoneme' element is not valid.", None, None, 0, None))
我一直在尝试调试,但是当我打印单词的发音时,字符是方框。但是,如果我从我的控制台复制并粘贴它们,它们看起来很好(见下文)。
həˈloʊ,
ˈwɝːld
ɪkˈstɛndəd,
ˈtɛst
最佳猜测
我不确定问题是否由 1) 我已经更改了 pythons 的版本以便能够打印 unicode 2)我解决了读取文件的问题 3) 我对字符串的操作不正确
我很确定问题是我没有将它作为 unicode 传递给 comtype 对象。我正在研究的想法是 1)是否缺少标志? 2)当它被传递给 comtypes(C 类型错误)时,它是否被转换为 ascii? 3) XML 是否传递错误/我是否遗漏了一步?
先睹为快代码
这是读取IPA字典然后生成XML文件的class。看看 _load_phonemes 和 _pronounce.
class SSML_Generator:
def __init__(self,pause,phonemeFile):
self.pause = pause
if isinstance(phonemeFile,str):
print("Loading dictionary")
self.phonemeDict = self._load_phonemes(phonemeFile)
print(len(self.phonemeDict))
else:
self.phonemeDict = {}
def _load_phonemes(self, phonemeFile):
phonemeDict = {}
with io.open(phonemeFile, 'r',encoding='utf-8') as f:
for line in f:
tok = line.split()
#print(len(tok))
phonemeDict[tok[0].lower()] = tok[1].lower()
return phonemeDict
def __call__(self,text):
SSML_document = self._header()
for utterance in text:
parent_tag = self._pronounce(utterance,SSML_document)
#parent_tag.tail = self._pause(parent_tag)
SSML_document.append(parent_tag)
ET.dump(SSML_document)
return SSML_document
def _pause(self,parent_tag):
return ET.fromstring("<break time=\"150ms\" />") # ET.SubElement(parent_tag,"break",{"time":str(self.pause)+"ms"})
def _header(self):
return ET.Element("speak",{"version":"1.0", "xmlns":"http://www.w3.org/2001/10/synthesis", "xml:lang":"en-US"})
# TODO: Add rate https://docs.microsoft.com/en-us/cortana/skills/speech-synthesis-markup-language#prosody-element
def _rate(self):
pass
# TODO: Add pitch
def _pitch(self):
pass
def _pronounce(self,word,parent_tag):
if word in self.phonemeDict:
sys.stdout.buffer.write(self.phonemeDict[word].encode("utf-8"))
return ET.fromstring("<phoneme alphabet=\"ipa\" ph=\"" + self.phonemeDict[word] + "\"> </phoneme>")#ET.SubElement(parent_tag,"phoneme",{"alphabet":"ipa","ph":self.phonemeDict[word]})#<phoneme alphabet="string" ph="string"></phoneme>
else:
return parent_tag
# Nice to have: Transform acronyms into their pronunciation (See say as tag)
我还添加了代码如何写入 comtype 对象 (SAPI),以防出现错误。
def __call__(self,text,outputFile):
# https://docs.microsoft.com/en-us/previous-versions/windows/desktop/ms723606(v%3Dvs.85)
self.stream.Open(outputFile + ".wav", self.SpeechLib.SSFMCreateForWrite)
self.engine.AudioOutputStream = self.stream
text = self._text_processing(text)
text = self.SSML_generator(text)
text = ET.tostring(text,encoding='utf8', method='xml').decode('utf-8')
self.engine.speak(text)
self.stream.Close()
在此先感谢您的帮助!
尽量在ph属性中使用单引号。 像这样
my_text = '<speak><phoneme alphabet="x-sampa" ph=\'v"e.de.ni.e\'>ведение</phoneme></speak>'
还记得用\转义单引号
UPD 此错误也可能意味着无法解析您的 ph。您可以在那里查看文档:https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/speech-synthesis-markup
这个例子可以工作
<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="en-US-Jessa24kRUS">
<s>His name is Mike <phoneme alphabet="ups" ph="JH AU"> Zhou </phoneme></s>
</voice>
</speak>
但这不是
<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="en-US-Jessa24kRUS">
<s>His name is Mike <phoneme alphabet="ups" ph="JHU AUA"> Zhou </phoneme></s>
</voice>
</speak>