Automatic Speech Annotation Using HMM based on Best Tree Encoding (BTE) Feature

Manual annotation for time-aligning a speech waveform against the corresponding phonetic sequence is a tedious and time consuming task. This paper aimed to introduce a completely automated phone recognition system based on Best Tree Encoding (BTE) 4-point speech feature. BTE is used to find phoneme boundaries along speech utterance. Comparison to Melfrequency cepstral coefficients (MFCCs) speech feature in solving the same problem is provided. Hidden Markov Model (HMM) and Gaussian Mixtures are used for building the statistical models through this research. HTK software toolkit is utilized for implementation of the model. The System can identify spoken phone at 59.1% recognition rate based on MFCC and 22.92% recognition rate based on BTE. The current BTE vector is 4 components compared to 39 components of MFCC. This makes it very promising features vector, BTE with 4 components gives a comparable recognition success rate compared to the 39 components MFCC vector widely in the area of ASR.


INTRODUCTION
Presently, manual annotation by expert phoneticians is the most precise way for time-aligning a speech waveform against the corresponding phonetic sequence.This is a tedious and time consuming task, which makes it a prohibitive choice for large speech corpora.Several approaches have been proposed for the task of speech segmentation [2][3][4][5][6].The most frequently used approach is based on HMM phone models.In this method each speech waveform is initially decomposed into a sequence of feature vectors, using a speech parameterization technique.Afterwards, a set of HMM phone models (phone recognizer) is utilized to extract the corresponding phonetic sequence as well as the positions of the phonetic boundaries.Other speech segmentation methods have also been proposed in the literature.Some of them include detection of variations/similarities in spectral or prosodic parameters of speech, template matching using dynamic programming and/or synthetic speech and discriminative learning segmentation.
Various speech parameterizations have been utilized in the phonetic segmentation task, with the Mel Frequency Cepstral Coefficients (MFCC) among the most widely used, especially in the HMM-based approach.Other speech features such as Perceptual Linear Prediction (PLP), Line Spectral Frequencies (LSF), Linear Predictive Coding (LPC), short-time energy, formants and wavelet-based have also been used.
Automatic annotation is used to make a preliminary solution before starting the manual annotation.Its task is to simplify the effort in the manual annotation task.In this paper, the most frequently approach -adapting a Hidden Morkov Model (HMM) based phonetic recognizer to the task of automatic phonetic segmentation is used.Our base line system contains 10ms frame rate with 25ms Hamming window.Here the speech is parameterized using MFCC and BTE.MFCC with 12 Mel-Frequency Cepstral Coefficients and normalized log energy, as well as their first and second order differences yielding a total of 39 components.Another parameterization technique is Best Tree Encoding BTE with 4 spectral based components.A set of context-independent Left -To -Right (LR) monophone HMMs with one Gaussian per state are flatinitialized.The HMM model is 3 emitting states.These HMMs are well trained using HMM Tool Kit (HTK) and both features MFCC and BTE for the problem of automatic annotation.Speech database is prepared to measure the quality of this experiment.Speech database is labeled and transcribed then verified to evaluate the results of automatic segmentation.The following sections will navigate through the details of this research.Section 2 will illustrate problem definition.In section 2, the HMM GMM based speech recognition will be illustrated.BTE speech feature is explored in section 3. The experimental Framework will be provided in section 4. The experimental procedure will be presented in section 5.The results will be presented in section 6.The conclusion will be given in section 7. Then finally the list of references will be listed in section 8.

PROBLEM DEFINITION
Automatic Speech annotation to Arabic phone level is the problem that is intended in this research.The phone is supposed to be the basic speech unit.Finding the phone boundaries along the stream of human speech is the basic definition of the annotation.Speech features should be stable along the phone duration.The best the features are the accurate the boundaries are.

HMM-GMM BASED SPEECH RECOGNITION
In HMM-GMM (Hidden Markov Model -Gaussian Mixture model related) based speech recognition ,see Gales and Young, 2007 for review [10], the short-time spectral Characteristics of speech is turned into a vector (the "observations" of Fig. 1, sometimes called frames), and build a generative model using a HMM that produces sequences of these vectors.A left-to-right three-state HMM topology as in Fig. 1 will typically model the sequence of frames generated by a single phone.Models for sentences are constructed by concatenating HMMs for sequences of phones.Different HMMs are used for phones in different left and right phonetic contexts, using a tree-based clustering approach to model unseen contexts, see Young et al., 1994 for review [11].The index j will be used for the individual context-dependent phonetic states, with1 ≤  ≤ .While j could potentially equal three times the cube of the number of phones (assuming only the immediate left and right phonetic context will be modeled), after tree-based clustering it will typically be several thousand.The distribution that generates a vector within HMM state j is a Gaussian Mixture Model (GMM): (1) Table 1 shows the parameters of the probability density functions (pdfs) in an example system of this kind: each context dependent state (of which we only show three rather than several thousands) has a different number of sub-states .Figure1: HMM for speech recognition HTK is principally concerned with continuous density models in which each observation probability distribution is represented by a mixture Gaussian density.In this case, for state j the probability () of generating observation  is given by where  is the number of mixture components in state j for stream s,   is the weight of the m'th component and ( •; , ∑ ) is a multivariate Gaussian with mean vector μ and covariance matrix ∑ , that is where n is the dimensionality of o.The exponent is a stream weight and its default value is one.
Other values can be used to emphasize particular streams, however, none of the standard HTK tools manipulate it.HTK also supports discrete probability distributions in which case where is the output of the vector quantiser for stream s given input vector and is the probability of state j generating symbol v in stream s.In addition to the above, any model or state can have an associated vector of duration parameters .Also, it is necessary to specify the kind of the observation vectors, and the width of the observation vector in each stream.Thus, the total information needed to define a single HMM is as follows IN automatic speech recognition (ASR) systems, it is normally used Gaussian mixture HMMs as acoustic models for modeling basic speech units, ranging from context-independent whole words in small vocabulary ASR tasks to contextdependent phonemes (e.g., triphones) in large vocabulary ASR.Traditionally, the HMM-based acoustic models are estimated from available training data using the well-known EM algorithm based on the maximum-likelihood (ML) criterion.To deal with data sparseness problems in model training, we normally use phonetic decision trees to tie HMM states from different triphone contexts.In order to derive a simple closed-form solution, we normally grow the decision trees based on simple models, such as single Gaussian HMMs.After the state-tied structure is determined from the decision trees, a separate "mixing-up" step is used to gradually increase the number of Gaussian mixtures in each tied HMM state until the optimal performance is achieved.In today's ASR systems, e.g., HTK, "mixing-up" is normally implemented in two steps [2]: 1) All existing Gaussians or the most dominant Gaussian mixture component in an HMM state is split based on some random or heuristic strategies.2) All split Gaussians are re-estimated based on the EM algorithm.Obviously, this incremental method for increasing model complexity is a good strategy to learn very large-scale statistical models without getting trapped in any bad local optimum.However, we still face some problems when increasing model complexity in the above "mixing-up" strategy.First of all, the random splitting strategy is not optimal in terms of the model estimation criterion.For example, there is no guarantee that the newly added Gaussian components from random splitting always increase the likelihood function prior to re-estimation.Second, since the subsequent EMbased re-estimation is sensitive to the initial parameters of the randomly split Gaussians, there is no guarantee that the EM-based re-estimation can always converge to the optimal point.
In HTK, the conversion from single Gaussian HMMs to multiple mixture component HMMs is usually one of the final steps in building a system.The mechanism provided to do this is the HHED MU command which will increase the number of components in a mixture by a process called mixture splitting.This approach to building a multiple mixture component system is extremely flexible since it allows the number of mixture components to be repeatedly increased until the desired level of performance is achieved.
The MU command has the form MU n itemList where n gives the new number of mixture components required and itemList defines the actual mixture distributions to modify.This command works by repeatedly splitting the mixture with the largest mixture weight until the required number of components is obtained.The actual split is performed by copying the mixture, dividing the weights of both copies by 2, and finally perturbing the means by plus or minus 0.2 standard deviations.For example, the command has the form MU n itemList For example, the command MU 3 {aa.state[2].mix} would increase the number of mixture components in the output distribution for state 2 of model aa to 3. Normally, however, the number of components in all mixture distributions will be increased at the same time.Hence, a command of the form is more usual It is usually a good idea to increment mixture components in stages, for example, by incrementing by 1 or 2 then reestimating, then incrementing by 1 or 2 again and re-estimating, and so on until the required number of components is obtained.This also allows recognition performance to be monitored to find the optimum.We can start prototype of phone in HMM with 4 mixtures per state.However, this was (a pretty good) guess of us.To be sure that we have chosen the optimal topology for our models there is no way to avoid the heuristic try-and-fail method.We ran a series of trainings on different number of mixtures.It is recommended to start with a single Gaussian model, train it until it converges on the dev set and then increase the number of mixtures by one, train them and so on.
One final point with regard to multiple mixture component distributions is that all HTK tools ignore mixture components whose weights fall below a threshold value called MINMIX (defined in HModel.h).Such mixture components are called defunct.Defunct mixture components can be prevented by setting the -w option in HEREST so that all mixture weights are floored to some level above MINMIX.If mixture weights are allowed to fall below MINMIX then the corresponding Gaussian parameters will not be written out when the model containing that component is saved.It is possible to recover from this, however, since the MU command will replace defunct mixtures before performing any requested mixture component increment.

BEST TREE ENCODING
BTE is a simple on/off entropy mapping of the signal into the bands in which the signal is decomposed using wavelet packets.The key property in BTE is the alignment of the neighboring frequency domain bands in wavelet packets decomposition of the signal.Adjacent bands are much closer in distance than the non adjacent bands.The indicated tree structure in figure 3 will be encoded into features vector of 3 elements as shown in table 2.

EXPERIMENT FRAMEWORK
The framework we developed to and test GMM HMM models uses HTK to do feature extraction and build the baseline models which are used to align the training data.Microsoft C# (C sharp) is used for building the needed programs and algorithms for building initial models of HTK.HTK tools for training and decoding is a collection of command-line options such as HERest and HVite.Each makes a special function, which is explained in detail in HTK book [9] The phonetic context tree of the HTK baseline models is utilized in the proposed system.Training and testing in the proposed system is based on Weighted Finite State.HTK tools evaluate the Viterbi path based on likelihood.The number of GM is a factor in the success rate for BTE experiment.This number is altered as an experiment parameter.Figure 4 gives the results of changing this value on the success rate.

CONCLOUSIONS
The results tabulated in table 1 indicate that BTE with 4 components is very promising.BTE is newly developed feature that relies on the spectral information.It is composed of 4 components that are used to encode the whole spectral information of the signal.It gives very close results to the well known feature MFCC with 39 components.This makes it a very promising enhancement that gives much more efficient results than MFCC.
Type of observation vector • Number and width of each data stream • Optional model duration parameter vector • Number of states • For each emitting state and each stream -Mixture component weights or discrete probabilities -If continuous density, then means and covariance -Optional stream weight vector -Optional duration parameter vector • Transition matrix bands are aligned such as to adjacent wavelet bands are closer in distance than non adjacent bands.

Figure 2 -
Figure 2-a illustrates how bands are sorted according to Matlab wavelet packets function.Figure2-b indicates how bands are encoded in BTE.Bands are rearranged for calculating the BTE of the frame.The tree is Encoded into a single number that held information of tree structure {leaves} and weight according to figure 2-b.

Figure 3 :
Figure 3: BTE for certain wavelet packets Best tree structure

Figure 4 :
Figure 4: Recognition Rate versus Max Number of Mixtures

TABLE 2 :
BEST TREE 4 POINT ENCODING EVALUATION

Table 3
All samples are processed to generate MFCC -39 points feature.HTK is used in this step.b.All samples are processed to generate BTE -4 points feature.Matlab is used in this step.Survey for the most frequently used Gaussian Mixture count for MFCC is used to set the number of Gaussian Mixtures of MFCC model.c.For BTE; Gaussian mixture count is an experiment parameter.It will be tuned for the best success rate.d.Dictionary and Grammar files will be created for HTK phone recognition problem.{Illustrate the Grammar file and the dictionary by a graph and a table that clarify the Grammar network and the dictionary} E. Training the Models.a.Using HTK and the training samples for MFCC, MFCC models will be trained.b.Using HTK and the training samples for BTE, BTE models will be trained.F.Testing the Models.a.Using HTK and the testing samples for MFCC, MFCC models will be tested.b.Using HTK and the testing samples for BTE, BTE models will be tested.illustrates the results obtained from both systems.As of the results BTE-4 indicates very comparable results to the well known MFCC features.BTE is still in the development phase.This makes it very promising.BTE is 4 components compared to 39 components of MFCC, makes it a very promising feature.
C. MarshalingAll feature files are normalized for being processed in HTK.This process is called marshaling.The data from different sources are rearranged in a way that to be understood by HTK tools.BTE feature vectors files are marshaled into HTK format.HTK allows for user defined features type.This will give HTK tools the ability to be used to process data from other sources not just HTK tools.D. Model Design a. Five nodes LR HMM model is created to model a single phone.b.

TABLE 3 :
BTE-4 VERSES MFCC-39 RECOGNITION RESULTS : the total number of labels in the reference transcriptions I: Number of Insertions errors in the results string.D: Number of deletion errors in results string.S: Number of substitution errors in results string. N