sphinx-4 aligner 会跳过简单的单词,如“you”、“in”和带有破折号的单词——为什么?
sphinx-4 aligner skips plain words like `you`, `in` and words with dashes - why?
我正在尝试对齐简单文本。以下是文本和音频文件的链接:
http://s000.tinyupload.com/?file_id=48044768133759453374
http://s000.tinyupload.com/?file_id=99891199139563396901
配置设置如下:
private static final String ACOUSTIC_MODEL_PATH =
"resource:/edu/cmu/sphinx/models/en-us/en-us";
private static final String DICTIONARY_PATH =
"resource:/edu/cmu/sphinx/models/en-us/cmudict-en-us.dict";
我得到的输出如下(省略号是我加的):
- ï
- ¿in
a [11250:11330]
standard [11330:11920]
shopping [11920:12440]
centre [12440:13020]
- you
can [13380:13730]
...
shops [15170:15790]
- you
can [16620:16890]
buy [16890:17140]
...
and [26920:27230]
suits [27190:27220]
- there’s
a [29160:29210]
sportswear [29210:29980]
...
clothes [33330:33360]
- t-shirts
shorts [35560:36320]
jumpers [36630:37410]
...
for [41860:42010]
正如您出于某种原因看到的那样:
- 在第一个
a
之前没认出 in
you
的多个实例没有计时
- 没有识别出
there's
,而是识别为 there’s
- 带破折号的单词没有计时,例如
t-shirts
有什么方法可以配置 sphinx 以提供出现的时间吗?
一些评论
didn't recognize in before the first a
您的文本文件有矫治器未知的 BOM 标记。最好在alignment
前去掉
didn't recognize there's, instead it identified it as there’s
您的文本使用了对齐器未知的 UTF-8 撇号。您最好将它们转换为等效的 ASCII
no timing for words with dashes, like t-shirts
字典里没有这些词。您可以在对齐之前将它们添加到字典中,或者指定g2p模型将它们转换为语音。
我正在尝试对齐简单文本。以下是文本和音频文件的链接:
http://s000.tinyupload.com/?file_id=48044768133759453374
http://s000.tinyupload.com/?file_id=99891199139563396901
配置设置如下:
private static final String ACOUSTIC_MODEL_PATH =
"resource:/edu/cmu/sphinx/models/en-us/en-us";
private static final String DICTIONARY_PATH =
"resource:/edu/cmu/sphinx/models/en-us/cmudict-en-us.dict";
我得到的输出如下(省略号是我加的):
- ï
- ¿in
a [11250:11330]
standard [11330:11920]
shopping [11920:12440]
centre [12440:13020]
- you
can [13380:13730]
...
shops [15170:15790]
- you
can [16620:16890]
buy [16890:17140]
...
and [26920:27230]
suits [27190:27220]
- there’s
a [29160:29210]
sportswear [29210:29980]
...
clothes [33330:33360]
- t-shirts
shorts [35560:36320]
jumpers [36630:37410]
...
for [41860:42010]
正如您出于某种原因看到的那样:
- 在第一个
a
之前没认出 you
的多个实例没有计时
- 没有识别出
there's
,而是识别为there’s
- 带破折号的单词没有计时,例如
t-shirts
in
有什么方法可以配置 sphinx 以提供出现的时间吗?
一些评论
didn't recognize in before the first a
您的文本文件有矫治器未知的 BOM 标记。最好在alignment
前去掉didn't recognize there's, instead it identified it as there’s
您的文本使用了对齐器未知的 UTF-8 撇号。您最好将它们转换为等效的 ASCII
no timing for words with dashes, like t-shirts
字典里没有这些词。您可以在对齐之前将它们添加到字典中,或者指定g2p模型将它们转换为语音。