Home   >   CSC-OpenAccess Library   >    Manuscript Information
Full Text Available

(354.46KB)
This is an Open Access publication published under CSC-OpenAccess Policy.
Publications from CSC-OpenAccess Library are being accessed from over 74 countries worldwide.
A Novel Algorithm for Acoustic and Visual Classifiers Decision Fusion in Audio-Visual Speech Recognition System
Rajavel, P.S. Sathidevi
Pages - 23 - 37     |    Revised - 25-02-2010     |    Published - 26-03-2010
Volume - 4   Issue - 1    |    Publication Date - March 2010  Table of Contents
MORE INFORMATION
KEYWORDS
Audio-visual speech recognition, Reliability-ratio based weight optimization, late integration
ABSTRACT
Audio-visual speech recognition (AVSR) using acoustic and visual signals of speech have received attention recently because of its robustness in noisy environments. Perceptual studies also support this approach by emphasizing the importance of visual information for speech recognition in humans. An important issue in decision fusion based AVSR system is how to obtain the appropriate integration weight for the speech modalities to integrate and ensure the combined AVSR system’s performances better than that of the audio-only and visual-only systems under various noise conditions. To solve this issue, we present a genetic algorithm (GA) based optimization scheme to obtain the appropriate integration weight from the relative reliability of each modality. The performance of the proposed GA optimized reliability-ratio based weight estimation scheme is demonstrated via single speaker, mobile functions isolated word recognition experiments. The results show that the proposed scheme improves robust recognition accuracy over the conventional unimodal systems and the baseline reliability ratio-based AVSR system under various signal to noise ratio conditions.
CITED BY (3)  
1 Stewart, D., Seymour, R., Pass, A., & Ming, J. (2014). Robust audio-visual speech recognition under noisy audio-video conditions. Cybernetics, IEEE Transactions on, 44(2), 175-184.
2 Shaikh, A. A., Kumar, D. K., & Gubbi, J. (2013). Automatic visual speech segmentation and recognition using directional motion history images and Zernike moments. The Visual Computer, 29(10), 969-982.
3 Chaudhary, K. (2012). Joint Error Optimization Algorithms for Multimodal Information Fusion.
1 Google Scholar 
2 ScientificCommons 
3 Academic Index 
4 CiteSeerX 
5 refSeek 
6 iSEEK 
7 Socol@r  
8 ResearchGATE 
9 Bielefeld Academic Search Engine (BASE) 
10 Scribd 
11 WorldCat 
12 SlideShare 
13 PDFCAST 
14 PdfSR 
1 K. Iwano, T. Yoshinaga, S. Tamura, S. Furui. “Audio-visual speech recognition using lip information extracted from side-face images”. EURASIP Journal on Audio, Speech, and Music Processing, (2007): 9 pages, Article ID 64506, 2007
2 J.S. Lee, C. H. Park. “Adaptive Decision Fusion for Audio-Visual Speech Recognition”’. In: F. Mihelic, J. Zibert (Eds.), Speech Recognition, Technologies and Applications, pp. 550 (2008)
3 J.S. Lee, C. H. Park. “Robust audio-visual speech recognition based on late integration”’. IEEE Transaction on Multimedia, 10: 767-779, 2008
4 G. F. Meyer, J. B.Mulligan, S. M.Wuerger. “Continuous audiovisual digit recognition using N-best decision fusion”. Information Fusion. 5: 91-101, 2004
5 A. Rogozan, P. Delglise. “Adaptive fusion of acoustic and visual sources for automatic speech recognition”. Speech Communication. 26: 149-161, 1998
6 G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior. “Recent advances in the automatic recognition of audio-visual speech”. In Proceedings of IEEE, 91(9), 2003
7 S. Dupont, J. Luettin. “Audio-visual speech modeling for continuous speech recognition”. IEEE Transanction on Multimedia, 2: 141-151, 2000
8 G. Potamianos, H. P. Graf, and E. Cosatto. “An image transform approach for HMM based automatic lipreading”. In Proceedings of International Conference on Image Processing. Chicago, 1998
9 R. Rajavel, P. S. Sathidevi. “Static and dynamic features for improved HMM based visual speech recognition”. In Proceedings of 1st International Conference on Intelligent Human Computer Interaction, Allahabad, India, 2009
10 G. Potamianos, A. Verma, C. Neti, G. Iyengar, and S. Basu. “A cascade image transform for speaker independent automatic speechreading”. In Proceedings of IEEE International Conference on Multimedia and Expo. New York, 2000
11 W. C. Yau, D. K. Kumar, S. P. Arjunan. “Voiceless speech recognition using dynamic visual speech features”. In Proceedings of HCSNet Workshop on the Use of Vision in HCI. Canberra, Australia, 2006
12 W. C. Yau, D. K. Kumar, H. Weghorn. “Visual speech recognition using motion features and Hidden Markov models”. In: M. Kampel, A. Hanbury (Eds.), LNCS, Springer, Heidelberg, pp. 832-839 (2007)
13 G. Potamianos, C. Neti, J. Luettin, and I. Matthews. “Audio-visual automatic speech recognition: An overview”. In: G. Baily, E. Vatikiotis-Bateson, P. Perrier (Eds.), Issues in visual and audio-visual speech processing, MIT Press, (2004)
14 R. Seymour, D. Stewart, J. Ming. “Comparison of image transformbased features for visual speech recognition in clean and corrupted videos”. EURASIP Journal on Image and Video Processing. (2008), doi:10.1155/2008/810362, 2008
15 B. Plannerer. “An introduction to speech recognition: A tutorial ”. Germany, 2003
16 L. Rabiner, B.H. Juang. “Fundamentals of Speech Recognition”’. Prentice Hall, Englewood Cliffs (1993)
17 B. Nasersharif, A. Akbari. “SNR-dependent compression of enhanced Mel sub-band energies for compensation of noise effects on MFCC features”. Pattern Recognition Letters, 28:1320-1326, 2007
18 T. Chen. “Audiovisual speech processing. Lip reading and lip synchronization”. IEEE Signal Processing Magazine, 18: 9-21, 2001
19 E. D. Petajan. “Automatic lipreading to enhance speech recognition”. In Proceedings of Global Telecommunications Conference. Atlanta, 1984
20 P. Arnold, F. Hill. “Bisensory augmentation: A speechreading advantage when speech is clearly audible and intact”. Brit. J. Psychol., 92: 339-355, 2001
21 A. Q. Summerfield. “Some preliminaries to a comprehensive account of audio-visual speech perception”. In: B. Dodd, R. Campbell (Eds.), Hearing by Eye: The Psychology of Lip-reading. Lawrence Erlbarum, London, pp. 3-51 (1987)
22 C. Benoit, T. Mohamadi, S. D. Kandel. “Effects of phonetic context on audio-visual intelligibility of French”. Journal of Speech and Hearing Research. 37: 1195-1203, 1994
23 C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, D. Vergyri, J. Sison, A. Mashari, and J. Zhou. “Audio visual speech recognition, Final Workshop 2000 Report”. Center for Language and Speech Processing, Johns Hopkins University, Baltimore, 2000
24 P. Teissier, J. Robert-Ribes, J. L. Schwartz. “Comparing models for audiovisual fusion in a noisyvowel recognition task”. IEEE Transaction on Speech Audio Processing, 7: 629-642, 1999
25 C. C. Chibelushi, F. Deravi, J. S. D. Mason. “A review of speech-based bimodal recognition”. IEEE Transactions on Multimedia, 4(1): 23-37, 2002
26 P.L. Silsbee. “Sensory integration in audiovisual automatic speech recognition”. In Proceedings of the 28th Annual Asilomar Conference on Signals, Systems, and Computers, 1: 561-565, 1994
27 C. Benot. “The intrinsic bimodality of speech communication and the synthesis of talking faces”. In: M. M. Taylor, F. Nel, D. Bouwhuis (Eds.), The Structure of Multimodal Dialogue II. Amsterdam, Netherlands, pp. 485-502 (2000)
28 G. Potamianos, C. Neti, J. Huang, J.H. Connell, S. Chu, V. Libal, E.Marcheret, N. Hass, J. Jiang. “Towards practical development of audiovisual speech recognition”. In Proceedings of IEEE International Conf. on Acoustic, Speech, and Signal Processing. Canada, 2004
29 S.W.Foo, L. Dong. “Recognition of Visual Speech Elements Using Hidden Markov Models”. In: Y. C. Chen, L.W. Chang, C.T. Hsu (Eds.), Advances in Multimedia Information Processing-PCM02, LNCS2532. Springer-Verlag Berlin Heidelberg, pp.607-614 (2002)
30 A. Verma, T. Faruquie, C. Neti, S. Basu. “Late integration in audiovisual continuous speech recognition”. In Proceedings of Workshop on Automatic Speech Recognition and Understanding. Keystone, 1999
31 S. Tamura, K. Iwano, S. Furui. “A stream-weight optimization method for multi-stream HMMs based on likelihood value normalization”. In Proceedings of ICASSP. Philadelphia, 2005
32 A. Adjoudani, C. Benot. “On the integration of auditory and visual parameters in an HMM-based ASR”. In: D. G. Stork and M. E. Hennecke (Eds.), Speech reading by Humans and Machines: Models, Systems, and Speech Recognition, Technologies and Applications, Springer, Berlin, Germany, pp. 461-472 (1996)
Mr. Rajavel
- India
rettyraja@gmail.com
Professor P.S. Sathidevi
NIT Calicut - India