Tutorials

The ISCSLP 2012 Organising committee is pleased to announce the following Tutorials at the conference:

Prof. KheChai Sim , Trajectory Modeling for Robust Speech Recognition

Prof. Tomoki Toda , A Statistical Approach to Voice Conversion and Its Applications for Augmented Human Communication

Dr. Dong Yu , Large Vocabulary Speech Recognition Using Deep Neural Networks: Insights, Theory, and Practice

Description

Trajectory Modeling for Robust Speech Recognition
Prof. KheChai Sim Assistant Professor, the School of Computing, National University of Singapore Watch online
top Abstract Hidden Markov Model (HMM) is widely used to represent acoustic units in automatic speech recognition (ASR) systems. One of the major limitations of HMM is the conditional independence assumption of the observation vector given its state. This leads to a poor trajectory model. Although many explicit trajectory modeling techniques have been proposed and studied in the past, the use of dynamic parameters remains the most popular way to circumvent the problem. Recently, advanced trajectory modeling techniques have been used to improve the acoustic models and noise robustness for ASR. Semi-parametric trajectory models have been proposed to implicitly model the trajectory using temporally varying model parameters. These parameters are modeled using a linear regression of some basis structures with temporally varying regression weights. Initial work has focused on modeling the temporally varying mean (fMPE) and precision matrices (pMPE). Temporally Varying Weight Regression (TVWR) has also been proposed to model the component weights of the Gaussian Mixture Model (GMM). As a result, TVWR is able to model non-stationary distribution within the HMM states, yielding an implicit trajectory model. Trajectory HMM has also been proposed to reformulate the conventional HMM by taking into consideration the explicit relationship between the static and dynamic acoustic features. Trajectory HMM is a generative model that is capable of producing smooth acoustic feature sequence, making it a popular choice for speech synthesis. However, trajectory HMM can only be used for N-best re-scoring in speech recognition tasks. Nevertheless, trajectory HMM has been recently applied to model-based noise compensation to improve the robustness of ASR in noisy environment. Traditionally, Parallel Model Combination (PMC) is used to synthesize noisy speech models given the clean speech and noise models. However, PMC cannot be easily applied to compensate the dynamic parameters. Therefore, trajectory HMM formulation was incorporated into PMC to yield a unified compensation for both static and dynamic parameters. This leads to a novel technique called Trajectory-based PMC (TPMC). This method has been found to outperform both PMC and Vector Taylor Series (VTS) methods. This tutorial aims to give an introduction to trajectory modeling for HMM-based speech recognition. This tutorial consists of two main parts. The first part of the tutorial will describe several semi-parametric trajectory models for speech recognition including fMPE, pMPE and TVWR. The second part of the tutorial will cover several model-based approaches to noise robust speech recognition including PMC, VTS and TPMC. Speaker Biography Dr. Khe Chai Sim is Assistant Professor at the School of Computing (SoC), National University of Singapore (NUS). He received the B.A. and M.Eng degrees in Electrical and Information Sciences from Cambridge University, England in 2001. He worked on the Application Programming Interface (API) for Hidden Markov Model Toolkit (HTK) (known as the ATK) for his Undergraduate final year project under the supervision of Prof. Steve Young. He was then awarded the Gates Cambridge Scholarship and completed his M.Phil dissertation "Covariance Matrix Modelling using Rank-One Matrices" in 2002 under the supervision of Prof. Mark Gales. He joined the Machine Intelligence Laboratory (MIL) (formerly the Speech, Vision and Robotics (SVR) group), Cambridge University Engineering Department in the same year as a research student, supervised by Prof. Mark Gales. He received his Ph.D degree in July 2006. He also worked as a Research Engineer at the Institute for Infocomm Research, Singapore between 2006 and 2010. His main research interests are statistical pattern classification and acoustic modelling for automatic speech recognition. He also worked on the DARPA funded Effective, Affordable and Reusable Speech-to-text (EARS) project from 2002-2005 and the Global Autonomous Language Exploitation (GALE) project between 2005-2006. He has participated in the NIST rich transcription evaluation, machine translation evaluation, language recognition evaluation and speaker recognition evaluation.

Trajectory Modeling for Robust Speech Recognition

Prof. KheChai Sim
Assistant Professor, the School of Computing, National University of Singapore

Watch online

top

Abstract

Hidden Markov Model (HMM) is widely used to represent acoustic units in automatic speech recognition (ASR) systems. One of the major limitations of HMM is the conditional independence assumption of the observation vector given its state. This leads to a poor trajectory model. Although many explicit trajectory modeling techniques have been proposed and studied in the past, the use of dynamic parameters remains the most popular way to circumvent the problem. Recently, advanced trajectory modeling techniques have been used to improve the acoustic models and noise robustness for ASR.

Semi-parametric trajectory models have been proposed to implicitly model the trajectory using temporally varying model parameters. These parameters are modeled using a linear regression of some basis structures with temporally varying regression weights. Initial work has focused on modeling the temporally varying mean (fMPE) and precision matrices (pMPE). Temporally Varying Weight Regression (TVWR) has also been proposed to model the component weights of the Gaussian Mixture Model (GMM). As a result, TVWR is able to model non-stationary distribution within the HMM states, yielding an implicit trajectory model.

Trajectory HMM has also been proposed to reformulate the conventional HMM by taking into consideration the explicit relationship between the static and dynamic acoustic features. Trajectory HMM is a generative model that is capable of producing smooth acoustic feature sequence, making it a popular choice for speech synthesis. However, trajectory HMM can only be used for N-best re-scoring in speech recognition tasks.

Nevertheless, trajectory HMM has been recently applied to model-based noise compensation to improve the robustness of ASR in noisy environment. Traditionally, Parallel Model Combination (PMC) is used to synthesize noisy speech models given the clean speech and noise models. However, PMC cannot be easily applied to compensate the dynamic parameters. Therefore, trajectory HMM formulation was incorporated into PMC to yield a unified compensation for both static and dynamic parameters. This leads to a novel technique called Trajectory-based PMC (TPMC). This method has been found to outperform both PMC and Vector Taylor Series (VTS) methods.

This tutorial aims to give an introduction to trajectory modeling for HMM-based speech recognition. This tutorial consists of two main parts. The first part of the tutorial will describe several semi-parametric trajectory models for speech recognition including fMPE, pMPE and TVWR. The second part of the tutorial will cover several model-based approaches to noise robust speech recognition including PMC, VTS and TPMC.

Speaker Biography

Dr. Khe Chai Sim is Assistant Professor at the School of Computing (SoC), National University of Singapore (NUS). He received the B.A. and M.Eng degrees in Electrical and Information Sciences from Cambridge University, England in 2001. He worked on the Application Programming Interface (API) for Hidden Markov Model Toolkit (HTK) (known as the ATK) for his Undergraduate final year project under the supervision of Prof. Steve Young. He was then awarded the Gates Cambridge Scholarship and completed his M.Phil dissertation "Covariance Matrix Modelling using Rank-One Matrices" in 2002 under the supervision of Prof. Mark Gales. He joined the Machine Intelligence Laboratory (MIL) (formerly the Speech, Vision and Robotics (SVR) group), Cambridge University Engineering Department in the same year as a research student, supervised by Prof. Mark Gales. He received his Ph.D degree in July 2006. He also worked as a Research Engineer at the Institute for Infocomm Research, Singapore between 2006 and 2010. His main research interests are statistical pattern classification and acoustic modelling for automatic speech recognition. He also worked on the DARPA funded Effective, Affordable and Reusable Speech-to-text (EARS) project from 2002-2005 and the Global Autonomous Language Exploitation (GALE) project between 2005-2006. He has participated in the NIST rich transcription evaluation, machine translation evaluation, language recognition evaluation and speaker recognition evaluation.

A Statistical Approach to Voice Conversion and Its Applications for Augmented Human Communication
Prof. Tomoki Toda Associate Professor, Nara Institute of Science and Technology, Japan Watch online
top Abstract Voice conversion is a technique for modifying speech acoustics, converting nonlinguistic information to any form we want while preserving the linguistic content. One of the most popular approaches to voice conversion is based on statistical processing, which is capable of extracting complex conversion functions from a parallel speech data set consisting of utterance pairs of the source and the target voices. Although this technique was originally studied in the context of speaker conversion, which converts the voice of a certain speaker (the source speaker) to sound like that of another speaker(the target speaker), it has great potential to achieve various applications beyond speaker conversion. In this tutorial, we first overview basic approaches to statistical voice conversion especially focusing on conversion methods that do not require any linguistic input. After reviewing frame-by-frame conversion as a standard method, we take a careful look at a state-of-the-art trajectory-based conversion method that is capable of using statistics calculated over an utterance to effectively reproduce natural speech parameter trajectories, and analyze which problems are well addressed in this method. Furthermore, we look at a technique that extends this trajectory-based conversion method to achieve a lower conversion delay, which enables to use state-of-the-art voice conversion in real-time applications. Real-time applications of voice conversion have a potential to enhance our human-to-human speech communication to overcome barriers such as physical constraints causing vocal disorders, and environmental constraints that do not allow for producing and conveying intelligible speech. In this tutorial, we look at body-conducted speech enhancement and speaking-aids for total laryngectomees as examples of voice conversion applications. The basic conversion algorithm is effectively applied to mapping problems between various different types of speech parameters. From these examples, we learn how to apply voice conversion techniques to individual speech applications to successfully augment human communication. Speaker Biography Tomoki Toda earned his B.E. degree from Nagoya University, Aichi, Japan, in 1999 and his M.E. and D.E. degrees from the Graduate School of Information Science, NAIST, Nara, Japan, in 2001 and 2003, respectively. He was a Research Fellow of JSPS in the Graduate School of Engineering, Nagoya Institute of Technology, Aichi, Japan, from 2003 to 2005. He was an Assistant Professor of the Graduate School of Information Science, NAIST from 2005 to 2011, where he is currently an Associate Professor. His research interests include statistical approaches to speech processing, such as voice transformation, speech synthesis, and speech analysis. He received the 18th TELECOM System Technology Award for Students and the 23rd TELECOM System Technology Award from the TAF, the 2007 ISS Best Paper Award and the 2010 ISS Young Researcher's Award in Speech Field from the IEICE, the 10th Ericsson Young Scientist Award from Nippon Ericsson K.K., the 4th Itakura Prize Innovative Young Researcher Award and the 26th Awaya Prize Young Researcher Award from the ASJ, and the 2009 Young Author Best Paper Award from the IEEE SPS. He was a member of the Speech and Language Technical Committee of the IEEE SPS from 2007 to 2009.

A Statistical Approach to Voice Conversion and Its Applications for Augmented Human Communication

Prof. Tomoki Toda
Associate Professor, Nara Institute of Science and Technology, Japan

Watch online

top

Abstract

Voice conversion is a technique for modifying speech acoustics, converting nonlinguistic information to any form we want while preserving the linguistic content. One of the most popular approaches to voice conversion is based on statistical processing, which is capable of extracting complex conversion functions from a parallel speech data set consisting of utterance pairs of the source and the target voices. Although this technique was originally studied in the context of speaker conversion, which converts the voice of a certain speaker (the source speaker) to sound like that of another speaker(the target speaker), it has great potential to achieve various applications beyond speaker conversion.

In this tutorial, we first overview basic approaches to statistical voice conversion especially focusing on conversion methods that do not require any linguistic input. After reviewing frame-by-frame conversion as a standard method, we take a careful look at a state-of-the-art trajectory-based conversion method that is capable of using statistics calculated over an utterance to effectively reproduce natural speech parameter trajectories, and analyze which problems are well addressed in this method. Furthermore, we look at a technique that extends this trajectory-based conversion method to achieve a lower conversion delay, which enables to use state-of-the-art voice conversion in real-time applications.

Real-time applications of voice conversion have a potential to enhance our human-to-human speech communication to overcome barriers such as physical constraints causing vocal disorders, and environmental constraints that do not allow for producing and conveying intelligible speech. In this tutorial, we look at body-conducted speech enhancement and speaking-aids for total laryngectomees as examples of voice conversion applications. The basic conversion algorithm is effectively applied to mapping problems between various different types of speech parameters. From these examples, we learn how to apply voice conversion techniques to individual speech applications to successfully augment human communication.

Speaker Biography

Tomoki Toda earned his B.E. degree from Nagoya University, Aichi, Japan, in 1999 and his M.E. and D.E. degrees from the Graduate School of Information Science, NAIST, Nara, Japan, in 2001 and 2003, respectively. He was a Research Fellow of JSPS in the Graduate School of Engineering, Nagoya Institute of Technology, Aichi, Japan, from 2003 to 2005. He was an Assistant Professor of the Graduate School of Information Science, NAIST from 2005 to 2011, where he is currently an Associate Professor. His research interests include statistical approaches to speech processing, such as voice transformation, speech synthesis, and speech analysis. He received the 18th TELECOM System Technology Award for Students and the 23rd TELECOM System Technology Award from the TAF, the 2007 ISS Best Paper Award and the 2010 ISS Young Researcher's Award in Speech Field from the IEICE, the 10th Ericsson Young Scientist Award from Nippon Ericsson K.K., the 4th Itakura Prize Innovative Young Researcher Award and the 26th Awaya Prize Young Researcher Award from the ASJ, and the 2009 Young Author Best Paper Award from the IEEE SPS. He was a member of the Speech and Language Technical Committee of the IEEE SPS from 2007 to 2009.

Large Vocabulary Speech Recognition Using Deep Neural Networks: Insights, Theory, and Practice
Dr. Dong Yu Researcher at Microsoft Speech Research Group Watch online
top Abstract In this tutorial, I will describe the promising context-dependent deep neural network (DNN) hidden Markov model (CD-DNN-HMM) for large vocabulary speech recognition (LVSR). The tutorial will cover the key insights, theory and practice. More specifically I will discuss why DNNs can be more powerful than the shallow neural networks, how DNNs can be effectively and efficiently trained, what are the core ingredients and procedures when applying DNNs to LVSR, where are the gains come from, and how additional error rate reduction may be achieved. Speaker Biography Dr. Dong Yu joined Microsoft Corporation in 1998 and Microsoft Speech Research Group in 2002, where he is a researcher. His current work focuses on deep learning and its application to large vocabulary speech recognition. The context-dependent deep neural network hidden Markov model (CD-DNN-HMM) he co-proposed and developed recently has been challenging the dominant position of the conventional GMM based system for large vocabulary speech recognition. Dr. Dong Yu has published around 100 papers in speech processing and machine learning and is the inventor/coinventor of more than 40 granted/pending patents. He is currently serving as an associate editor of IEEE transactions on audio, speech, and language processing (2011-) and has served as an associate editor of IEEE signal processing magazine (2008-2011) and the lead guest editor of IEEE transactions on audio, speech, and language processing - special issue on deep learning for speech and language processing (2010-2011).

Large Vocabulary Speech Recognition Using Deep Neural Networks: Insights, Theory, and Practice

Dr. Dong Yu
Researcher at Microsoft Speech Research Group

Watch online

top

Abstract

In this tutorial, I will describe the promising context-dependent deep neural network (DNN) hidden Markov model (CD-DNN-HMM) for large vocabulary speech recognition (LVSR). The tutorial will cover the key insights, theory and practice. More specifically I will discuss why DNNs can be more powerful than the shallow neural networks, how DNNs can be effectively and efficiently trained, what are the core ingredients and procedures when applying DNNs to LVSR, where are the gains come from, and how additional error rate reduction may be achieved.

Speaker Biography

Dr. Dong Yu joined Microsoft Corporation in 1998 and Microsoft Speech Research Group in 2002, where he is a researcher. His current work focuses on deep learning and its application to large vocabulary speech recognition. The context-dependent deep neural network hidden Markov model (CD-DNN-HMM) he co-proposed and developed recently has been challenging the dominant position of the conventional GMM based system for large vocabulary speech recognition. Dr. Dong Yu has published around 100 papers in speech processing and machine learning and is the inventor/coinventor of more than 40 granted/pending patents. He is currently serving as an associate editor of IEEE transactions on audio, speech, and language processing (2011-) and has served as an associate editor of IEEE signal processing magazine (2008-2011) and the lead guest editor of IEEE transactions on audio, speech, and language processing - special issue on deep learning for speech and language processing (2010-2011).