sphinx-4 aligner 会跳过简单的单词，如“you”、“in”和带有破折号的单词——为什么？

Question

我正在尝试对齐简单文本。以下是文本和音频文件的链接：
http://s000.tinyupload.com/?file_id=48044768133759453374
http://s000.tinyupload.com/?file_id=99891199139563396901

配置设置如下：

private static final String ACOUSTIC_MODEL_PATH =
        "resource:/edu/cmu/sphinx/models/en-us/en-us";
private static final String DICTIONARY_PATH =
        "resource:/edu/cmu/sphinx/models/en-us/cmudict-en-us.dict";

我得到的输出如下（省略号是我加的）：

- ï
- ¿in
  a                         [11250:11330]
  standard                  [11330:11920]
  shopping                  [11920:12440]
  centre                    [12440:13020]
- you
  can                       [13380:13730]
  ...
  shops                     [15170:15790]
- you
  can                       [16620:16890]
  buy                       [16890:17140]
  ...
  and                       [26920:27230]
  suits                     [27190:27220]
- thereâ€™s
  a                         [29160:29210]
  sportswear                [29210:29980]
  ...
  clothes                   [33330:33360]
- t-shirts
  shorts                    [35560:36320]
  jumpers                   [36630:37410]
  ...
  for                       [41860:42010]

正如您出于某种原因看到的那样：

在第一个 a

in

you
没有识别出 there's，而是识别为 thereâ€™s
带破折号的单词没有计时，例如 t-shirts

有什么方法可以配置 sphinx 以提供出现的时间吗？

Answer 1

一些评论

didn't recognize in before the first a

您的文本文件有矫治器未知的 BOM 标记。最好在alignment

前去掉

didn't recognize there's, instead it identified it as thereâ€™s

您的文本使用了对齐器未知的 UTF-8 撇号。您最好将它们转换为等效的 ASCII

no timing for words with dashes, like t-shirts

字典里没有这些词。您可以在对齐之前将它们添加到字典中，或者指定g2p模型将它们转换为语音。

sphinx-4 aligner 会跳过简单的单词，如“you”、“in”和带有破折号的单词——为什么？

sphinx-4 aligner skips plain words like `you`, `in` and words with dashes - why?

speech-recognition

sphinx4