扩展 Kaldi Aspire:使用新词典和语法文件重新编译 HCLG.fst 时出现变量错误

Extending Kaldi Aspire: bad variable error while Recompiling HCLG.fst using new lexicon and grammar files

我已经在 WSL 上成功设置并 运行 Kaldi Aspire 配方。现在我正在研究一个 POC,我想通过制作一个新的语料库、字典、语言模型并将其与原始 HCLG.fst 合并来扩展 ASPIRE 配方。我关注了这个blog post。我已经能够成功创建新词典、语言模型并合并输入文件。但是,当我尝试使用新的 Lexicon 和语法重新编译 HCLG.fst 时出现以下错误。

Checking update-model/local/dict/silence_phones.txt ...
--> reading update-model/local/dict/silence_phones.txt
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> update-model/local/dict/silence_phones.txt is OK

Checking update-model/local/dict/optional_silence.txt ...
--> reading update-model/local/dict/optional_silence.txt
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> update-model/local/dict/optional_silence.txt is OK

Checking update-model/local/dict/nonsilence_phones.txt ...
--> reading update-model/local/dict/nonsilence_phones.txt
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> update-model/local/dict/nonsilence_phones.txt is OK

Checking disjoint: silence_phones.txt, nonsilence_phones.txt
--> disjoint property is OK.

Checking update-model/local/dict/lexicon.txt
--> reading update-model/local/dict/lexicon.txt
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> update-model/local/dict/lexicon.txt is OK

Checking update-model/local/dict/lexiconp.txt
--> reading update-model/local/dict/lexiconp.txt
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> update-model/local/dict/lexiconp.txt is OK

Checking lexicon pair update-model/local/dict/lexicon.txt and update-model/local/dict/lexiconp.txt
--> lexicon pair update-model/local/dict/lexicon.txt and update-model/local/dict/lexiconp.txt match

Checking update-model/local/dict/extra_questions.txt ...
--> reading update-model/local/dict/extra_questions.txt
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> update-model/local/dict/extra_questions.txt is OK
--> SUCCESS [validating dictionary directory update-model/local/dict]

fstaddselfloops update-model/dict/phones/wdisambig_phones.int update- 
model/dict/phones/wdisambig_words.int
prepare_lang.sh: validating output directory
utils/validate_lang.pl update-model/dict
Checking existence of separator file
separator file update-model/dict/subword_separator.txt is empty or does not exist, deal in word case.
Checking update-model/dict/phones.txt ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> update-model/dict/phones.txt is OK

Checking words.txt: #0 ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> update-model/dict/words.txt is OK

Checking disjoint: silence.txt, nonsilence.txt, disambig.txt ...
--> silence.txt and nonsilence.txt are disjoint
--> silence.txt and disambig.txt are disjoint
--> disambig.txt and nonsilence.txt are disjoint
--> disjoint property is OK

Checking sumation: silence.txt, nonsilence.txt, disambig.txt ...
--> found no unexplainable phones in phones.txt

Checking update-model/dict/phones/context_indep.{txt, int, csl} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 20 entry/entries in update-model/dict/phones/context_indep.txt
--> update-model/dict/phones/context_indep.int corresponds to update-model/dict/phones/context_indep.txt
--> update-model/dict/phones/context_indep.csl corresponds to update-model/dict/phones/context_indep.txt
--> update-model/dict/phones/context_indep.{txt, int, csl} are OK

Checking update-model/dict/phones/nonsilence.{txt, int, csl} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 156 entry/entries in update-model/dict/phones/nonsilence.txt
--> update-model/dict/phones/nonsilence.int corresponds to update-model/dict/phones/nonsilence.txt
--> update-model/dict/phones/nonsilence.csl corresponds to update-model/dict/phones/nonsilence.txt
--> update-model/dict/phones/nonsilence.{txt, int, csl} are OK

Checking update-model/dict/phones/silence.{txt, int, csl} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 20 entry/entries in update-model/dict/phones/silence.txt
--> update-model/dict/phones/silence.int corresponds to update-model/dict/phones/silence.txt
--> update-model/dict/phones/silence.csl corresponds to update-model/dict/phones/silence.txt
--> update-model/dict/phones/silence.{txt, int, csl} are OK

Checking update-model/dict/phones/optional_silence.{txt, int, csl} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 1 entry/entries in update-model/dict/phones/optional_silence.txt
--> update-model/dict/phones/optional_silence.int corresponds to update-model/dict/phones/optional_silence.txt
--> update-model/dict/phones/optional_silence.csl corresponds to update-model/dict/phones/optional_silence.txt
--> update-model/dict/phones/optional_silence.{txt, int, csl} are OK

Checking update-model/dict/phones/disambig.{txt, int, csl} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 10 entry/entries in update-model/dict/phones/disambig.txt
--> update-model/dict/phones/disambig.int corresponds to update-model/dict/phones/disambig.txt
--> update-model/dict/phones/disambig.csl corresponds to update-model/dict/phones/disambig.txt
--> update-model/dict/phones/disambig.{txt, int, csl} are OK

Checking update-model/dict/phones/roots.{txt, int} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 43 entry/entries in update-model/dict/phones/roots.txt
--> update-model/dict/phones/roots.int corresponds to update-model/dict/phones/roots.txt
--> update-model/dict/phones/roots.{txt, int} are OK

Checking update-model/dict/phones/sets.{txt, int} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 43 entry/entries in update-model/dict/phones/sets.txt
--> update-model/dict/phones/sets.int corresponds to update-model/dict/phones/sets.txt
--> update-model/dict/phones/sets.{txt, int} are OK

Checking update-model/dict/phones/extra_questions.{txt, int} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 10 entry/entries in update-model/dict/phones/extra_questions.txt
--> update-model/dict/phones/extra_questions.int corresponds to update-model/dict/phones/extra_questions.txt
--> update-model/dict/phones/extra_questions.{txt, int} are OK

Checking update-model/dict/phones/word_boundary.{txt, int} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 176 entry/entries in update-model/dict/phones/word_boundary.txt
--> update-model/dict/phones/word_boundary.int corresponds to update-model/dict/phones/word_boundary.txt
--> update-model/dict/phones/word_boundary.{txt, int} are OK

Checking optional_silence.txt ...
--> reading update-model/dict/phones/optional_silence.txt
--> update-model/dict/phones/optional_silence.txt is OK

Checking disambiguation symbols: #0 and #1
--> update-model/dict/phones/disambig.txt has "#0" and "#1"
--> update-model/dict/phones/disambig.txt is OK

Checking topo ...

Checking word_boundary.txt: silence.txt, nonsilence.txt, disambig.txt ...
--> update-model/dict/phones/word_boundary.txt doesn't include disambiguation symbols
--> update-model/dict/phones/word_boundary.txt is the union of nonsilence.txt and silence.txt
--> update-model/dict/phones/word_boundary.txt is OK

Checking word-level disambiguation symbols...
--> update-model/dict/phones/wdisambig.txt exists (newer prepare_lang.sh)
Checking word_boundary.int and disambig.int
sh: 2: export: (x86)/Intel/Intel(R): bad variable name
--> generating a 88 word/subword sequence
sh: 2: export: (x86)/Intel/Intel(R): bad variable name
--> ERROR: number of reconstructed words 0 does not match real number of words 88; indicates problem in L.fst or word_boundary.int.  phoneseq = , wordseq = finches pei reservations rambo mommy courtship dawdling divas vox reorient boomtown whore protectorate hurt rayner topeka adamant mugs fouls birth a._k. stand discontents amazed laurels buttering sidetrack boundary lamport occasional suspicion shortcut melons until threats droppings tourette's greece boo competence fire's throat reimburse buffington waged griffith's meshes twiddling forecasting peters catastrophe tiptoe psychoanalysis statewide polar diluting bandit acronyms alvarez snatching nolte dreary fonder snacked navigate foolish severe barbara influenza shelled manuel adulterous antisocial army palace dollars whiff chalice paws injuries pop legume hyped invalids chide goodridge crappie raving
--> generating a 48 word/subword sequence
sh: 2: export: (x86)/Intel/Intel(R): bad variable name

Checking update-model/dict/oov.{txt, int} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 1 entry/entries in update-model/dict/oov.txt
--> update-model/dict/oov.int corresponds to update-model/dict/oov.txt
--> update-model/dict/oov.{txt, int} are OK

sh: 2: export: (x86)/Intel/Intel(R): bad variable name
--> ERROR: update-model/dict/L.fst is not olabel sorted
sh: 2: export: (x86)/Intel/Intel(R): bad variable name
--> ERROR: update-model/dict/L_disambig.fst is not olabel sorted
--> ERROR (see error messages above)
prepare_lang.sh: error validating output

我也在 Kaldi help group 上问过这个问题。 Dan Povey 曾建议这可能是一个本地问题,子 shell 可能正在生成并引发此错误。

我的pwd输出如下:--

/home/nitin/kaldi/egs/aspire/s5

我的path.sh如下:

export KALDI_ROOT=`pwd`/../../..
export PATH=$PWD/utils/:$KALDI_ROOT/tools/openfst/bin:$PWD:$PATH
[ ! -f $KALDI_ROOT/tools/config/common_path.sh ] && echo >&2 "The standard 
file $KALDI_ROOT/tools/config/common_path.sh is not present -> Exit!" && exit 
1
. $KALDI_ROOT/tools/config/common_path.sh
export PATH=$KALDI_ROOT/tools/sctk/bin:$PATH
export LC_ALL=C
source ../../../tools/env.sh

我的 cmd.sh 如链接博客中所述 post 需要在 运行ning 后续命令之前获取:

# "queue.pl" uses qsub.  The options to it are
# options to qsub.  If you have GridEngine installed,
# change this to a queue you have access to.
# Otherwise, use "run.pl", which will run jobs locally
# (make sure your --num-jobs options are no more than
# the number of cpus on your machine.

#a) JHU cluster options
export train_cmd="queue.pl"
export decode_cmd="queue.pl --mem 4G"
export mkgraph_cmd="queue.pl --mem 8G"


#b) BUT cluster options
#export train_cmd="queue.pl -q all.q@@blade -l ram_free=1200M,mem_free=1200M"
#export decode_cmd="queue.pl -q all.q@@blade -l ram_free=1700M,mem_free=1700M"
#export decodebig_cmd="queue.pl -q all.q@@blade -l ram_free=4G,mem_free=4G"

#export cuda_cmd="queue.pl -q long.q@@pco203 -l gpu=1"
#export cuda_cmd="queue.pl -q long.q@pcspeech-gpu"
#export mkgraph_cmd="queue.pl -q all.q@@servers -l ram_free=4G,mem_free=4G"

#c) run it locally...
#export train_cmd=run.pl
#export decode_cmd=run.pl
#export cuda_cmd=run.pl
#export mkgraph_cmd=run.pl

这里有 Linux 队长可以帮助我吗?

错误信息

sh: 2: export: (x86)/Intel/Intel(R): bad variable name

表示由于缺少引号而导致的分词问题。

文本 (x86)/Intel/Intel(R) 看起来像是包含 space 的目录路径的一部分,因为它在 Windows 上很常见。可能类似于

C:/Program Files (x86)/Intel/Intel(R) something

您可能可以在 PATH 变量的值中找到它。

根据 KALDI 帮助组中引用的线程,问题可能在您的 path.sh 文件中。

用你当前的工作目录/home/nitin/kaldi/egs/aspire/s5行就不会出现问题

export KALDI_ROOT=`pwd`/../../..

但为了避免可能出现的问题,应该

export KALDI_ROOT="$(pwd)"/../../..

export KALDI_ROOT="$(pwd)/../../.."

问题似乎出现在脚本的第 2 行(与错误消息匹配):

export PATH=$PWD/utils/:$KALDI_ROOT/tools/openfst/bin:$PWD:$PATH

我猜你的 PATH 包含带有 space 的目录(包括错误消息中显示的部分)。在这种情况下,shell 将在每个 space 上拆分行,您将得到类似

的内容
export PATH=something maybe_something_else (x86)/Intel/Intel(R) maybe_again_something

这将(尝试)导出变量 PATHmaybe_something_else(x86)/Intel/Intel(R)maybe_again_something... 这不是您想要的。您希望所有这些都在 PATH.

的值中

您很幸运从 shell 收到有关无效变量名称 (x86)/Intel/Intel(R) 的错误消息。如果所有部分都是有效的变量名,你会得到一个错误的 PATH 和一些不需要的环境变量,但没有错误消息。

所以你也应该引用这一行,一般来说所有可能包含 spaces 的变量的扩展。

我建议把path.sh改成

export KALDI_ROOT="$(pwd)/../../.."
export PATH="$PWD/utils/:$KALDI_ROOT/tools/openfst/bin:$PWD:$PATH"
[ ! -f "$KALDI_ROOT/tools/config/common_path.sh" ] && echo >&2 "The standard 
file $KALDI_ROOT/tools/config/common_path.sh is not present -> Exit!" && exit 
1
. "$KALDI_ROOT/tools/config/common_path.sh"
export "PATH=$KALDI_ROOT/tools/sctk/bin:$PATH"
export LC_ALL=C
source ../../../tools/env.sh

我不知道 KALDI。此文件是生成的还是您手动创建的?

export KALDI_ROOT=`pwd`/../../..

可能会有问题,因为它取决于您 运行 脚本时的当前工作目录。我不知道是否有一种机制可以确保您 运行 它仅来自此脚本所在的目录。否则这将导致 KALDI_ROOT

的错误值

我不知道是否有理由这样做,但使用绝对路径而不是依赖于您的工作目录的路径可能有意义。

当前目录/home/nitin/kaldi/egs/aspire/s5将导致

export KALDI_ROOT=/home/nitin/kaldi/egs/aspire/s5/../../..

我会将脚本中的行替换为

export KALDI_ROOT=/home/nitin/kaldi

您可以在 KALDI 帮助组中询问此建议。

编辑:

如果向 path.sh 添加引号还不够,请同时检查 $KALDI_ROOT/tools/config/common_path.sh$KALDI_ROOT/tools/env.sh 以及其他可能存在的脚本。

作为起点,您可以搜索同时包含 export$PATH 行的文件。 (当然这也可能发生在其他变量上。)示例:

find /home/nitin/kaldi -type f -exec grep 'export.*$PATH' {} /dev/null \;

我刚刚注意到 path.sh 有点不一致。它同时使用 pwd 和变量 $PWD 的命令替换以及 $KALDI_HOME 和硬编码 ../../.. 就好像这些行是由不同的人写的一样。