Kaldi中的网格

翻译:izy

时间:2015年7月

Introduction

网格(lattice)是某个特定语句的最可能的可选词序列的表示。为了更好地理解网格,你应该了解 WFST框架下的解码图(见 Decoding graph construction in Kaldi)。本节中我们概述 Kaldi网格的相关问题,并在本页后面的部分详细地解释它们。读者可能也想读这一篇文献“Generating exact lattices in the WFST framework”, D.Povey, M.Hannemann et.al, ICASSP 2012 (submitted),其中给出了更多网格是如何与以前的算法相联系的内容。本页更侧重于编程的方面。

Kaldi 中的网格有两种表示。第一种是 Lattice。这是一个FST,权重包含两个浮点权值(图代价(the graph cost)和声学代价(the acoustic cost)),输入符号是 transition-ids (近似于上下文无关的 HMM状态),输出符号是词(一般来讲,它们表示解码图的输出符号)。

第二种是 CompactLattice。它本质上包含了和 Lattice相同的信息,但是形式不同:它是一个 acceptor(意味着输出输出符号始终相同),输出输出符号代表词,而 transition-ids序列作为了权重的一部分(注意 OpenFst有一个很通用的权重概念;满足半环公理的都可以作为权值)。CompactLattice 中的权重包括一对浮点数和一个整数序列(代表 transition-ids)。我们会在后面描述半环。

因为 Lattice和 CompactLattice都代表同样的信息,它们在 I/O意义上是等价的,即读 Lattice的代码也可以读 CompactLattice,反之亦然。举例来说,如果用SequentialLatticeReader来读由 utterance-id索引的 lattice集合,不论 archive中实际上是 Lattice还是 CompactLattice,处理上都是一样的。然而,我们通常把 lattices写出为 CompactLattice。对于需要在两者之间进行转换的程序,可以使用 ConvertLattice()

Lattice和 CompactLattice类型中的权重包括两个浮点数,即 graph cost和 acoustic cost。graph cost是 LM代价,(加权)转移概率和任何发音代价的和。它们被混在了一起,所以在做 LM重打分时,我们先要减去旧的 LM代价再加上新的 LM代价。如果我们把这两个浮点数叫做 (a,b),Lattice中的半环像字典半环 (a+b,a-b)(见 OpenFst文档中的解释)。 也就是说,当我们“加”(在半环中)这样的两对儿数时,我们得到具有最低 (graph + acoustic) cost的一对,如果它们相等,选用 (graph - acoustic)来打破平局。这在“取最优路径”的意义上是合理的。实际上,它只有在 acoustic weights适当缩放时才是合理的。我们的惯例是在外部存储(e.g. 磁盘)时,acoustic costs不缩放,在算法比较权重时应用缩放(在写出 lattice时再缩放回来)。

Lattice 和 CompactLattice类型被当做数据结构来表示传统的网格,也用于表示 N-best列表。生成网格的算法有很精确的特性。设定 lattice-delta为一个数值(通常为10左右)。网格生成算法可保证(modulo beam pruning)每个词序列的最优对齐的代价,在当前网格中最优路径的 lattice-delta范围内,而且是具有最可能的对齐和权值。同时,每个词序列只在网格中出现一次。要达到这一点,首先创建一个状态级的网格然后适当剪枝(pruning),然后用一个特殊的半环 determinize(或者,如果你愿意,可以认为这是一个总是选择最可能路径的 non-functional FSTs的特殊 determinization算法)。下面我们会解释这是如何实现的。

我们始终确保写出到磁盘的网格,每个词序列只有一条路径。这对某些算法(如 LM重打分)是必要的。Being determinized (on the word labels, in the CompactLattice format) is a sufficient condition for this.

我们从不在 Kaldi代码中附加符号表给 FSTs。为了查看包含实际词的最自然的形式的网格的 archives,我们用脚本来完成这一转换(例子见 egs/rm/s1/steps/decode_tri1_latgen.sh)

The Lattice type

Lattice类型只是一个基于特定半环模板化的 FST。在 kaldi-lattice.h,我们创建一个 typedef:

typedef fst::VectorFst<LatticeArc> Lattice;

这里 LatticeArc是基于 LatticeWeight模板化的 arc类型:

typedef fst::ArcTpl<LatticeWeight> LatticeArc

LatticeWeight又是 LatticeWeightTpl模板在 BaseFloat浮点类型的一个实例化。

typedef fst::LatticeWeightTpl<BaseFloat> LatticeWeight;

LatticeWeightTpl模板是文件 fstext/lattice-weight.h中的命名空间 fst中定义的。它和字典半环有一些相似处,i.e. 它类似于 OpenFst类型。

LexicographicWeight<TropicalWeight, TropicalWeight>

两种情况下,加法(带运算的半环)定义为取最大值,但是“最大值”的定义不同。LexicographicWeight 首先比较第一个元素并用第二个来打破平局。LatticeWeight首先比较和值;用差值打破平局。所以一个 LatticeWeight (a,b)和一个 LexigraphicWeight (a+b,a-b)等价。LatticeWeight的意图是取代价最小的(总代价是 graph + acoustic cost),同时分别记住 acoustic and graph cost。在理想设定中,我们可能想保留更多的信息:例如,LM cost和 transition-model cost和 pronumciation-probability cost。但这不切实际,因为这些信息在解码图中混在了一起,在解码图中把它们分开会导致解码图尺寸的显著扩张。

上面提到了,Lattice的输入符号表示 transition-ids,输出符号表示 words(或者解码图输出的任何符号)。

设计 Kaldi时,我们考虑过用 LexicographicWeight类型替代 LatticeWeight,第一个元素是 (graph + acoustic) cost第二个元素是 acoustic cost。最终没有这样做,因为尽管它在某些时候更有效,我们觉得它定义的有点混乱。

网格上的很多算法(如取最优路径,或剪枝)用 Lattice类型比 CompactLattice类型效率更高。这是因为 CompactLattice类型中的权重包括了 transition-ids序列,像取最优路径的操作,会将权重相乘,对应着序列的相加。对很多算法来说,时间复杂度是词网格长度的平方。上面提到,可以读 archive中的 Lattice,即便它实际上是 CompactLattice,因为 Holder类型(LatticeHolder)会自动转换。所以代码:

  SequentialLatticeReader lattice_reader(lats_rspecifier);
  for (; !lattice_reader.Done(); lattice_reader.Next()) {
     std::string key = lattice_reader.Key();
     Lattice lat = lattice_reader.Value();
     ...
  }

是合法的即使“lats_rspecifier”对应的 archive或 script file包含的是 CompactLattice格式-通常会是这样,因为网格都是以这种格式写出的。网格类型转换的例子:

  Lattice lat; 
  // initialize lat.
  CompactLattice compact_lat;
  ConvertLattice(lat, &compact_lat);

转变为 CompactLattice类型包含一个“factoring”操作,函数 fst::Factor(定义在 fstext/facor.h),标识可以组合成一个 CompactLattice arc的状态链。把 OpenFst算法用于网格的一个典型例子如下(代码改动自 latbin/lattice-best-path.cc,寻找网格中的最优路径):

  Lattice lat; 
  // initialize lat.
  Lattice best_path;
  fst::ShortestPath(lat, &best_path);

Compact lattices (the CompactLattice type)

和 Lattice类型类似,CompactLattice是一个典型 OpenFst模板的一个 typedef:

typedef fst::VectorFst<CompactLatticeArc> CompactLattice;

CompactLatticeArc 类型定义如下:

typedef fst::ArcTpl<CompactLatticeWeight> CompactLatticeArc;

CompactLatticeWeight 类型定义如下:

typedef fst::CompactLatticeWeightTpl<LatticeWeight, int32> CompactLatticeWeight;

模板参数是基础权重(LatticeWeight)和一个储存 transition-ids的整数类型(int32)。注意在 Kaldi代码中(比如上面),我们倾向于定义复合类型,然而在 OpenFst中是定义模板,当我们在命名空间 fst内写代码时,我们会遵循这一惯例。所以,我们在命名空间 fst定义了一个模板,然后把它变成命名空间 kaldi中的一个特定的类型(CompactLatticeWeight)。

CompactLatticeWeightTpl的模板变量是基础权重(LatticeWeight)和整数类型(int32)。它包含两个数据成员:一个权值和一个整数序列:

template<class WeightType, class IntType>
class CompactLatticeWeightTpl {
   ...
 private:
  WeightType weight_;
  vector<IntType> string_; 
};

它们可以通过成员函数 Weight()String()来访问。据我们所知,CompactLatticeWeightTpl用到的半环和 OpenFst中的并不对应。乘法对应于权值相乘和字符串相连。加两个 CompactLatticeWeghts时,我们首先比较权值,选取其中权值大的和它对应的字符串。如果权值相等,我么用字符串的字符顺序来打破平局。这个字符顺序的规则是,如果长度不同选短的,否则用字典序(我们不能只用字典序作规则,否则会与半环公理中的乘法分配律冲突)

我们对权重比较大小或判断相等,前提是半环上是有次序的,这是模板权重的一个要求(LatticeWeight和 TropicalWeight是满足了的)。OpenFst访问这个次序的方式是通过模板"NaturelLess",它对给定的 a, b,判断 (a+b)是否与 a或 b相等。为了提高效率,我们根据 weight类型定义了一个Compare()函数,返回值是-1,0 or 1。我们为 LatticeWeightTpl模板定义了这样的函数。同时为了测试的目的,我们也给 CompactLatticeWeightTpl在 TropicalWeight的实例定义了合适的 Compare函数。

我们定义这样一个看起来很奇怪的半环是有原因的,这和我们的网格概念有关。网格包含了对任意给定的词序列的最优路径。假定有一个网格(e.g. 可能是一个 raw state-level lattice),我们用 CompactLattice表示,即一个词序列的 acceptor,同时权重包含 transition-id序列。WFSTs的含义是对任意的(输入符号序列,输出符号序列)对,WFST分配的权重是所有包含该对符号序列的路径的权重的和。目前的情况是,我们处理的 acceptor只有符号序列(词序列)。当我们在半环中对(weight,string)求和时,我们得到的是最优的(即最小代价)的权重对。如果我们要在半环中 remove epsilons and determinize,我们会得到一个对每个词序列只有一个路径的网格,并且这个路径是最优的。这样效率不高,所以实际上我们有一个专门的 determinization算法来 remove epsilon,同时它的作用和调用 OpenFst的RemoveEps()Determinize()是一样的。

Getting the raw lattice, and converting it into the final form

目前,生成网格的唯一的解码器是定义在 decoder/lattice-simple-decoder.h中的类 LatticeSimpleDecoder,它被 gmm-latgen-simple.cc调用。顾名思义,LatticeSimpleDecoder是由 SimpleDecoder修改得到的。SimpleDecoder是 Viterbi beam search算法的一个直接实现,只有一个可调参数:the pruning beam(见 SimpleDecoder:the simplest possible decoder)。LatticeSimpleDecoder有一个更重要的可调参数:the lattice beam (or lattice-delta),一般应该比 the pruning beam 小。基本框架是,LatticeSimpleDecoder先生成一个状态级的网格,然后用 lattice-delta剪枝,最后执行 determinization算法,对每个词序列仅保留最优路径。

在 SimpleDecoder中,有引用计数的回溯(reference-counted tracebacks)。在 LatticeSimpleDecoder中,单个回溯是不够的,因为网格具有更复杂的结构。实际上,存储前向链接比后向链接更为方便。为了释放 lattice-delta pruning时不需要的链接,我们需要做的比引用计数更多;实际上也没做引用计数。

The final objective of the pruning algorithm is as follows (call this the "naive" approach. Firstly, suppose that for each frame, for each state that was active on that frame we created a structure or token of some kind, and for each arc out of it (to a state that was not pruned), we create some kind of link record. Let's use the word "token" for the structure that we create for a particular graph state at a particular frame, and "link" for the forward link from one token to another token (note: these links can be within a frame as well as across frames, due to epsilon arcs in the decoding graph). The simplest possible version of the pruning algorithm is: firstly, we decode up till the end of the file, keeping all the tokens and links that were active during decoding. When we have come to the end of the file we want to apply the lattice-delta beam, such that any state or arc that is not on a path that is within lattice-delta cost of the best path, will be pruned away. This is fairly easy to do: for example, we could turn the state-level traceback into an FST and use OpenFst's Prune() algorithm. Unfortunately this wouldn't be a very practical algorithm because we would soon run out of memory: on each frame there are typically many thousands of states active, and we can't afford to wait till the end of the utterance to prune them.

Fortunately there is an efficient way to prune away these tokens without waiting until the end of the utterance. Periodically (every 25 frames by default), we do a backward sweep from the current frame to the beginning. In this backward sweep, we work out a "delta" quantity for each token. The "delta" is the difference between the cost of the best path and the cost of the best path the token is on, and it can be computed in the backward sweep (the "forward" information that we need it already inside the token, and is static). We can define an analogous "delta" quantity for the forward links. Any time this "delta" exceeds the "lattice-delta" beam, we prune away the associated token or link. For the tokens at the currently active frame, we set the "delta" to zero; we can prove that this is the right thing to do. That is, with this algorithm we will only be pruning away things that we eventually would have pruned away anyway using the "naive" algorithm.

It might seem on the face of it that our pruning algorithm would take time quadratic in the length of the utterance (because each 25 frames we visit the whole utterance up to the current point), but in fact it is more like linear, because we can detect when all the "delta" values of tokens for a particular frame have not changed, and we can stop going backward in time. We expect that for very long utterances, we will normally prune backward for only a second or two. Even if it were not for this feature, the algorithm would be dominated by a linear-time component in practical cases, because the vast majority of tokens are deleted the first time they are visited by the algorithm.

We note that there are a couple of complexities in our implementation of this. One is that because we have forward links, while the pruning algorithm goes backward in time, there is a danger of "dangling" forward links being created when we delete() tokens, so we have to avoid delete()ing tokens until we have pruned the forward links from the previous frame. The other issue is that we have to treat the final frame specially, because we need to take into account the final probabilities.

Getting the raw lattice, and converting it into the final form

When we have reached the end of the utterance, we produce the lattice in two phases. The first phase is to create the "raw lattice". This is a state-level FST that is pruned using the lattice-delta but is not determinized, so it has many alternative paths corresponding to each word sequence. Creation of the "raw lattice" is very trivial, and consists of taking our token and link structures and translating them into OpenFst structures. If we are asked for the best path (e.g. for getting the transcript), we would just run OpenFst's ShortestPath algorithm on this raw lattice. If we are asked for the lattice, there are two options.

The first option is: the decoder supports outputting the raw state-level lattice directly (set –determinize-lattice=false); this is mostly useful if you are going to do acoustic rescoring with a system built with the same tree. Afterwards you would probably want to determinize the lattice using the lattice-determinize program, before storing the lattices on disk for an extended time (as the raw state-level lattices are quite large).

Let's assume instead that we're outputting the determinized lattice (the second option). The determinized lattice is supposed to have only one path through it for every word sequence that is allowed in the original raw lattice. We can acheive this by creating a CompactLattice, removing epsilons and determinizing, but this would be very inefficient. Instead, we run a special determinization algorithm called DeterminizeLattice(). This algorithm does epsilon removal and determinization together. In this respect it is similar to our algorithm DeterminizeStar(), but the DeterminizeLattice() algorithm uses data structures that are specifically optimized for this case. The issue is the cost overhead of appending long strings of input symbols together (the sequences of symbols are only about as long as the maximum number of frames in a word, but this is quite long). That is, if represented in the most naive way as vectors, adding a symbol to a list of N symbols and storing the result separately take time O(N). Instead we use data structures that make it take constant time. DeterminizeLattice() takes as input the Lattice format, since this is the most natural form for the input, and outputs in the CompactLattice format. However, it should really be viewed as determinization of the CompactLattice format: that is, it preserves equivalence only when viewing it as an operation on a CompactLattice.

Lattices in archives

我们一般用 archives来存储网格。关于 archive和 Kaldi I/O机制的信息,见 Kaldi I/O mechanisms。用命令行生成网格的一个具体例子如下:

gmm-latgen-simple --beam=13.0 --acoustic-scale=0.0625 exp/tri/1.mdl \
     exp/graph_tri/HCLG.fst ark:test_feats.ark "ark,t:|gzip -c > exp/lats_tri/1.lats.gz"

这之后,我们可以看到如下的网格:

gunzip -c exp/decode_tri2a_bg_latgen_eval92/7.lats.gz | \
  scripts/int2sym.pl --field 3 data/words.txt | head
444c0410
0 1 <s> 5.27832,0,
1 2 <UNK> 8.08116,1099.84,
1 3 A 12.8737,8342.93,3_1_1_1_1_1_1_1_1_1_1_12_18_17_17_17_17_17_17_17_17_17_3\
464_3596_3632_3_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_\
1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_12_18_17_17_18562_18561_18604_18\
603_18603_18618_604_603_603_638
1 4 IF 10.2262,8096.64,3_1_1_1_1_1_1_1_1_1_1_12_18_17_17_17_17_17_17_13708_137\
28_13806_12654_12670_12676_3_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1\
_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_12_18_17_17_17_1856\
2 5 NOT 12.4568,10548.7,3_1_1_1_1_1_1_1_1_1_1_12_18_17_17_17_17_17_17_17_20_26\
...

标签<UNK>具有 graph和 acoustic scores但是没有输入标签(如果有,会在最后一个逗号后面出现),在这里似乎不太合适。必须理解的是,graph/acoustic scores和输入序列只有在通过 FST的完整路径上叠加(或连接)后才是有效的。并不需要保证,它们彼此之间对齐或和词标签对齐。

Lattices通常以 CompactLattice的形式存储在 archive,而且惯例是 acoustic weights不采用缩放,所以对于对 acoustic weight敏感的运算(如剪枝),对应的命令行会有 -acoustic-scale选项,并在进行运算前缩放 acoustic weights(运算结束后缩放回来)。

Operations on lattices

下面我们讨论了一些网格上的运算操作和它们对应的程序。

Pruning lattices

网格可以用一个设定的 pruning beam来剪枝。这会去掉和网格中最优路径的代价相差不够小的那部分 arcs和 states。对每个网格,lattice-prune首先根据设定的尺度缩放 acoustic weight,然后调用 OpenFst的Prune()函数,最后对 acoustic weight进行逆缩放。一个命令行示例:

lattice-prune --acoustic-scale=0.1 --beam=5 ark:in.lats ark:out.lats

注意:为了提高效率,剪枝运算是在 Lattice格式下完成的,但是会在写出时转换为 CompactLattice。

Computing the best path through a lattice

lattice-best-path计算通过网格的最优路径,并输出输入符号序列(alignment)和输出符号序列(transcription)。通常输入和输出都是 archive。一个命令行示例:

lattice-best-path --acoustic-scale=0.1 ark:in.lats ark:out.tra ark:out.ali

Computing the N-best hypotheses

lattice-nbest计算通过网格的 N-best路径(利用 OpenFst的ShortestPath()函数),输出是一个有特定结构的网格(a CompactLattice)。正如在 OpenFst的ShortestPath()的文档中说明的,开始状态会有(至多)n个 arcs,每一个都是一条单独路径的开始。但是这些路径可以共享后缀。一个命令行示例:

lattice-nbest --n=10 --acoustic-scale=0.1 ark:in.lats ark:out.nbest

Language model rescoring

因为 LatticeWeight的“graph part”(第一部分)包含了语言模型得分和转换模型(transition-model)得分,所有的发音或静音概率,我们不能只用新的语言模型得分替换它,否则会丢失后两个。所以,我们要先减去旧的 LM概率再加上新的 LM概率。这两个阶段的核心操作是组合(还有一些权值的缩放,等等)。相应的命令行是:首先,一处旧的 LM概率

lattice-lmrescore --lm-scale=-1.0 ark:in.lats G_old.fst ark:nolm.lats

然后加上新的 LM概率

lattice-lmrescore --lm-scale=1.0 ark:nolm.lats G_new.fst ark:out.lats

也有其他的方法可以做到这一点,见下面lattice-compose的文档。要解释程序 lattice-lmrescore做了些什么,我们先描述一个简化的版本。假设给定 LM-scale S和 LM G.fst。我们首先把 G.fst中的 costs都乘以 S。然后对每个网格,把它组合到 G的右边,进行 lattice-determinize(仅保留每个词序列的最优路径),然后写出。如果 S是正值,这没有问题,但如果 S是负值,这相当于对每个词序列取了 G的最差路径。为了解决这一问题,我们采用如下做法。对每个输入的网格,首先用 S的逆缩放网格的 graph (or LM) costs;然后组合到 G.fst的右边;执行 lattice-determinization,保留每个词序列的最优路径;然后用 S缩放网格的 graph/LM scores。这样 S为负值时也是对的。这个方法只有在输入网格的每个词序列都只有一条路径时(e.g. 它经过了 lattice determinization)才有效。我们假定所有由程序处理的网格都有这个特点(所以直接写 “raw” state-level网格不是一个好主意,除非你知道你要做什么)。

注意为了让组合生效,程序需要把 G.fst从 tropical semiring映射到 LatticeWeight semiring。这通过把 G.fst的权值赋给权重的第一部分(the "graph" part),然后权重的第二部分置零(在半环中,它是1)。在 C++层,这个映射通过 OpenFst的 MapFst完成,其中定义了一个 Mapper类来映射 StdArc到 LatticeArc,然后根据需求创建一个 MapFst类型的对象,完成 G.fst到 LatticeWeight的权重部分的转换。

Probability scaling

可以通过一个 2x2的矩阵来对网格权重的概率进行比例缩放。相应的命令行程序是lattice-scale。一个示例:

lattice-scale --acoustic-scale=0.1 --lm-scale=0.8 ark:in.lats ark:out.lats

它实际上不会被经常使用,因为我们希望 archive中的 acoustic weights是没有经过缩放的;需要进行缩放的程序可以接受一个-acoustic-scale选项。这些程序会应用缩放,执行运算(比如 determinization or pruning),然后逆缩放。对大多数操作,我们不会缩放 LM概率;然而,它有时也会被用到,e.g. 在区分性训练(discriminative training)时同时调整 LM和 acoustic尺度可能是有好处的。lattice-scale程序接受-lm2acoustic-scale-acoustic2lm-scale选项。例如,程序会把-lm2acoustic-scale乘以权重的 LM部分然后加到新权重的 acoustic部分。我们引入这些主要是考虑到完整性。

Lattice union

lattice-union计算两个网格的 union(和其他程序一样,程序遍历 archive中的网格集合)。主要设想的用途是在区分性训练中(特别是 MMI),为了保证分母网格中的 transcription是正确的。一个命令行示例是:

lattice-union ark:num_lats.ark ark:den_lats.ark ark:augmented_den_lats.ark

程序调用 OpenFst的Union()函数,进行 lattice-determinization来保证每个词序列仅有一条路径,然后写出。实际上 lattice union发挥作用的场景并不多(e.g. 在由不同特征或模型计算得到的网格上进行 union并没有什么意义)

Lattice composition

lattice-compose有多种工作模式。一种是组合网格。这是在转换器(transducer)形式下完成的,i.e.把网格看做 transition-ids到 words的一个转换器。这一般会在映射一个网格集合到输出(来获得词到词的转换器)后完成。典型的用法是:

  lattice-project ark:1.lats ark:-  | \
    lattice-compose ark:- ark:2.lats ark:3.lats

现在,3.lats的路径会具有 1.lats和 2.lats中得分的和。你可以用lattice-interp(见下一节)来一步完成这一目的。

另一种模式是组合网格和一个固定的 FST。为了这个目的,FST被动态地转换为网格;FST的权重解释为网格权重的“graph part”。一个例子是,把网格转换成音素,再和字典组合转换回词,例如:

 lattice-to-phone-lattice final.mdl ark:1.lats ark:- | \
   lattice-compose ark:- Ldet.fst ark:words.lats

这是一个简化了的例子;你可以进一步处理,去除网格中原来的“graph scores”然后加上 grammar和 transition scores。同时,Ldet.fst是一个专门处理了的词典:带消歧符号的词典(但是没有 #0到 #0的环,来通过语言模型消歧符号),determinized and with the disambiguation symbols then removed.

上面的模式对语言模型重打分也很有用;然而它也有不同的版本,如下所示(这对把 LM得分加到网格里时很有用,假定我们已经移除了旧的 LM得分):

  lattice-compose --phi-label=13461 ark:1.lats G.fst ark:2.lats

这里,G.fst是在输入端包含特殊的“回退符号”(backoff symbol) #0的语法,13461是这个回退符的数字标号。这种组合形式把回退符看做一个失败的 transition(利用 OpenFst's PhiMatcher),当请求的标签不存在时被采用:所以 G.fst被看做一个回退语言模型。

Lattice interpolation

lattice-interp根据一个缩放参数,对两个网格做内插。例子:

  lattice-interp --alpha=0.4 ark:1.lats ark:2.lats ark:3.lats

意思是在 1.lats的得分上乘以0.4,在 2.lats上乘以0.6。它会维持第一个 archive中的对齐信息,如果有的话。如果两个网格没有共同路径,组合会为空,程序就不会输出任何网格,所以最后的 archive中可能有“gaps”。你需要建立某种机制来为这些 utterances构造输出,到解码脚本中(e.g. 见 egs/rm/s3/steps/decode_combine.sh)。这个程序的功能可被模拟为 lattice-scale,lattice-project,lattice-compose的组合。

Conversion of lattices to phones

lattice-to-phones-lattice在网格的输出端删除词标签,替换为音素标签。这些是通过输入的 transition-ids获得的。注意音素标签并不是和音素边界对齐的(FSTs 在输入和输出符号之间没有对齐的概念)。典型用法是:

  lattice-to-phones final.mdl ark:1.lats ark:phones.lats

Lattice projection

Projection 是把 FST转变为 acceptor的一个 FST操作,在输入和输出符号间进行拷贝,让它们一致。lattice-project默认是把词标签拷贝至输入端(这在做网格内插时很有用)。例子是:

  lattice-project ark:1.lats ark:- | \
     lattice-compose ark:- ark:2.lats ark:3.lats

Lattice equivalence testing

lattice-equivalent作为一个调试工具非常有用。它测试网格是否相等,如果是,返回状态0。它通过利用一个随机化的相等测试算法来完成。一个例子:

  lattice-equivalent ark:1.lats ark:2.lats || echo "Not equivalent!"

可选的参数包括 -num-paths(在随机化的测试算法中用到的路径数)和 -delta(在相等测试时允许的得分的差异值)。

Removing alignments from lattices

lattice-rmali从网格的输入端移除对齐信息(i.e. the transition-ids)。这在你不需要某种信息时很有用(e.g. LM重打分时你只需要网格),而且可以节省储存空间。例子:

  lattice-rmali ark:in.lats ark:word.lats

Error boosting in lattices

The program lattice-boost-ali is used in boosted MMI training. It reads in a lattice and an alignment (both as tables, e.g. archives), and outputs the lattices with the language model scores (i.e. the graph part of the scores) boosted by the parameter b times the number of frame-level phone errors on that path. Typical usage is:

  lattice-boost-ali --silence-phones=1:2:3 --b=0.1 final.mdl ark:1.lats \
   ark:1.ali ark:boosted.lats

The silence phones are treated specially: wherever they appear in the lattices, they are assigned zero error, i.e. they are not counted as errors (this behavior can be controlled with the –max-silence option). Note that this special treatment of silence has been adopted from MPE training where it appeared to help; we have not investigated its exact effect on boosted MMI.

Computing posteriors from lattices

The program lattice-to-post computes, from a lattice, per-frame posterior probabilities of the transition-ids in the lattice. This is done by a standard forward-backward type of algorithm. Since, by convention, we store lattices without any acoustic scaling, it will normally be necessary to supply an acoustic scale to be used in the forward-backward algorithm. It also accepts a language model scale, but the default (1.0) will often be most appropriate. An example of using this program is:

  lattice-to-post --acoustic-scale=0.1 ark:1.lats ark:- | \
    gmm-acc-stats 10.mdl "$feats" ark:- 1.acc

Determinization of lattices

The program lattice-determinize implements lattice-determinization, which essentially consists of keeping, for each word-sequence in the lattice, only the best-scoring sequence of transition-ids. In general this process is sensitive to the acoustic scale (because "best-scoring" depends on the scale), but if the lattice has previously been determinized and then has only been processed in a "word-level" way, i.e. each word-sequence still has only one transition-id sequence, then the acoustic scale doesn't matter. The only time the acoustic scale is likely to matter is if you generate state-level lattices, e.g. you do lattice generation with –determinize-lattice=false. This program has other options, that are mostly related to what to do if determinization "blows up", i.e. exhausts memory, but in general you can just leave these at their default values. These are mostly there because a previous version of the determinization tended to exhaust memory. You may want to set –prune=true if you want to do pruning as part of the same program, but be careful that this only makes sense if you have set the acoustic scale to an appropriate value; you might also want to set the –beam option in that case.

A typical usage is:

   lattice-determinize ark:1.lats ark:det.lats

or:

   lattice-determinize --acoustic-scale=0.08333 ark:state_level.lats ark:1.lats

Note that lattices produced by lattice generation programs will be default already be determinized, so it only makes sense to do determinization if you have just done an operation like composition that makes lattices nondeterministic, or if you have given –determinize-lattice=false to the lattice generation program to create state-level lattices.

Computing oracle WERs from lattices

lattice-oracle输入是两个 tables:第一个是网格,第二个是 transcriptions(C++ 用 vector<>表示),输出是网格的词序列;在日志中会打印出相应的 WERs。这是通过构建一个“edit-distance FST”,把它和由网格和参考得到的 unweighted acceptors组合起来,来完成的。

一个例子(这个脚本片段来自 s3系列):

 cat $data/text | \
  sed 's:<NOISE>::g' |  sed 's:<SPOKEN_NOISE>::g'  | \
  scripts/sym2int.pl --ignore-first-field $lang/words.txt | \
  lattice-oracle --word-symbol-table=$lang/words.txt  \
    "ark:gunzip -c $dir/lats.pruned.gz|" ark:- ark,t:$dir/oracle.tra \
    2>$dir/oracle.log

Adding transition probabilities to lattices

lattice-add-trans-probs把 transition概率加到网格的 costs中。这在做丢弃原始的 graph costs并从头重建的 lattice rescoring时很有用。一个典型用法:

  lattice-add-trans-probs --transition-scale=1.0 --self-loop-scale=0.1 \
      final.mdl ark:1.lats ark:2.lats

关于这些缩放参数的更多解释,见 Scaling of transition and acoustic probabilities

Converting lattices to FSTs

lattice-to-fst将网格转换为 FSTs,和编译训练图(the training graphs)的代码用到的格式相同。得到的 FSTs就是 weighted word acceptors(尽管目前的程序中我们乘以了系数0,也就是没有 weights)。它们可以作为compile-train-graphs-fst的输入来生成只包含网格中存在的路径的解码图。它的主要用途是在不共享同一个树的模型间做 lattice rescoring。一个例子是:

  lattice-to-fst --lm-scale=0.0 --acoustic-scale=0.0 ark:1.lats ark:1.words

Copying lattices

lattice-copy作用就是拷贝网格,这在你想看二进制 FST的文本形式时很有用。例如:

  lattice-copy ark:1.lats ark,t:- | head -50

N-best lists and best paths

有些时候(e.g. Viterbi training;用神经网络语言模型重打分)我们不需要网格结构而是需要最佳路径或 N-best路径。N-best列表的格式和网格一样,除了每个句子有多个 FSTs(最多 n个,如果设定了 n)。假设 utterance ids是 uttA, uttB等。网格的一个 Table(e.g. an archive)会包含 uttA的网格,uttB的网格,等等。如果运行:

  lattice-to-nbest --acoustic-scale=0.1 --n=10 ark:1.lats ark:1.nbest

那么 archive 1.nbest会包含网格 uttA-1, uttA-2,...uttA-10和 uttB-1...uttB-10等等。当然某些语句可能没有这么多的 N-best,因为网格没有那么多的不同的词序列。lattice-to-nbest需要 acoustic scale因为这影响到哪个路径得分最低。

某些情况下处理 FST还是不够方便,而我们需要一个线性结构,e.g. 一个词序列。这时,我们可以用:

  nbest-to-linear ark:1.nbest ark:1.ali ark:1.words ark:1.lmscore ark:1.acscore

这个程序读网格的 archive,它必须是一个线性 FST,然后输出4个 archives,对应输入符号序列,输出符号序列,the acoustic and LM scores。格式会像下面一样(假定你以文本方式写):

# head 1.words
utt-A-1 10243 432  436 10244
utt-A-2 7890  420 10244
...

Kaldi会处理后面的数字 id;你可以用符号表和 scripts/int2sym.pl把它们转换为词。上面的逆变换可以用linear-to-nbest完成。

另一个有用的程序是lattice-to-1best,它和lattice-to-nbest --n=1不太一样,因为语句id后面不会有“-1”。你可以把输出送入到nbest-to-linear来得到词序列或类似的内容。lattice-best-path输入网格并输出两个 archives,包含词序列和 transition-id序列。这样用是有历史原因的;lattice-to-1bestnbest-to-linear组合起来用更加清晰,和整体框架也更一致。

Times on lattices

如果你明确地想要网格的时间信息(而不是要去数 transition-ids),有 LatticeStateTimes函数(for Lattice)和 CompactLatticeStateTimes函数(for CompactLattice),会给出你每个状态的时间位置(从0到文件的帧数)。要注意的是,一般来讲,词和 transition-ids是没有对齐的,意味着同一个 arc上的 transition-ids不一定都属于 arc上的词标签。也就是说你从网格得到的时间(仅考虑到词标签)是不准确的。对权重来说也是这样,它们并没有和 arc上的词或 transition-ids对齐。如果你想要精确的时间(e.g. 用于转换为 HTK网格,或 sclite scoring),你需要运行lattice-align-words。这个程序只有在你使用 word-position-dependent phones搭建系统时才能发挥作用,而且它需要有特定的命令行选项来告诉它音素在词中的位置。一个例子见 egs/wsj/s3/run.sh。如果你的系统没有 word-position-dependent phones,你可以用另一个程序lattice-align-words-lexicon