| |
| |
|
|
|
|
| A Novel Algorithm for Acoustic and Visual Classifiers Decision Fusion in Audio-Visual Speech Recognition System
|
|
Full
text: |
PDF(354.5KB) |
|
|
Source |
Signal Processing: An International Journal (SPIJ) |
|
Table of Contents |
|
|
Download
Complete Issue PDF(1.86MB) |
|
Volume: 4 Issue: 1 |
| |
Pages: 1-67 |
|
Publication
Date: March 2010 |
|
ISSN
(Online): 1985-2339 |
|
|
|
|
|
Pages |
23 - 37 |
|
Author(s) |
|
|
|
Published
Date |
26-03-2010 |
|
Publisher |
CSC
Journals, Kuala Lumpur,
Malaysia |
|
ADDITIONAL
INFORMATION |
| Keywords Abstract References Cited by Related Articles Collaborative
Colleague |
| |
|
| |
KEYWORDS: Audio-visual speech recognition, Reliability-ratio based weight optimization, late integration |
|
|
| |
|
|
| This Manuscript is indexed in the following databases/websites:- |
|
| 1. Directory of Open Access Journals (DOAJ) |
| 2. Docstoc |
| 3. PDFCAST |
| 4. Scribd |
| 5. WorldCat |
| 6. ScientificCommons |
| 7. Google Scholar |
| 8. refSeek |
| 9. ResearchGATE |
| 10. Bielefeld Academic Search Engine (BASE) |
| 11. Academic Index |
| 12. iSEEK |
| 13. Socol@r |
| |
|
| |
|
|
| Audio-visual speech recognition (AVSR) using acoustic and visual signals of speech have received attention recently because of its robustness in noisy environments. Perceptual studies also support this approach by emphasizing the importance of visual information for speech recognition in humans. An important
issue in decision fusion based AVSR system is how to obtain the appropriate integration weight for the speech modalities to integrate and ensure
the combined AVSR system’s performances better than that of the audio-only and visual-only systems under various noise conditions. To solve this issue, we present a genetic algorithm (GA) based optimization scheme to obtain the appropriate integration weight from the relative reliability of each modality. The performance of the proposed GA optimized reliability-ratio based weight estimation scheme is demonstrated via single speaker, mobile functions isolated
word recognition experiments. The results show that the proposed scheme improves robust recognition accuracy over the conventional unimodal systems and the baseline reliability ratio-based AVSR system under various signal to noise ratio conditions. |
| |
|
| |
|
| |
| 1 |
K. Iwano, T. Yoshinaga, S. Tamura, S. Furui. “Audio-visual speech recognition using lip information extracted from side-face images”. EURASIP Journal on Audio, Speech, and Music Processing, (2007): 9 pages, Article ID 64506, 2007 |
|
|
| 2 |
J.S. Lee, C. H. Park. “Adaptive Decision Fusion for Audio-Visual Speech Recognition”’. In: F. Mihelic, J. Zibert (Eds.), Speech Recognition, Technologies and Applications, pp. 550 (2008) |
|
|
| 3 |
J.S. Lee, C. H. Park. “Robust audio-visual speech recognition based on late integration”’. IEEE Transaction on Multimedia, 10: 767-779, 2008 |
|
|
| 4 |
G. F. Meyer, J. B.Mulligan, S. M.Wuerger. “Continuous audiovisual digit recognition using N-best decision fusion”. Information Fusion. 5: 91-101, 2004 |
|
|
| 5 |
A. Rogozan, P. Delglise. “Adaptive fusion of acoustic and visual sources for automatic speech recognition”. Speech Communication. 26: 149-161, 1998 |
|
|
| 6 |
G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior. “Recent advances in the automatic recognition of audio-visual speech”. In Proceedings of IEEE, 91(9), 2003 |
|
|
| 7 |
S. Dupont, J. Luettin. “Audio-visual speech modeling for continuous speech recognition”. IEEE Transanction on Multimedia, 2: 141-151, 2000 |
|
|
| 8 |
G. Potamianos, H. P. Graf, and E. Cosatto. “An image transform approach for HMM based automatic lipreading”. In Proceedings of International Conference on Image Processing. Chicago, 1998 |
|
|
| 9 |
R. Rajavel, P. S. Sathidevi. “Static and dynamic features for improved HMM based visual speech recognition”. In Proceedings of 1st International Conference on Intelligent Human Computer Interaction, Allahabad, India, 2009 |
|
|
| 10 |
G. Potamianos, A. Verma, C. Neti, G. Iyengar, and S. Basu. “A cascade image transform for speaker independent automatic speechreading”. In Proceedings of IEEE International Conference on Multimedia and Expo. New York, 2000 |
|
|
| 11 |
W. C. Yau, D. K. Kumar, S. P. Arjunan. “Voiceless speech recognition using dynamic visual speech features”. In Proceedings of HCSNet Workshop on the Use of Vision in HCI. Canberra, Australia, 2006 |
|
|
| 12 |
W. C. Yau, D. K. Kumar, H. Weghorn. “Visual speech recognition using motion features and Hidden Markov models”. In: M. Kampel, A. Hanbury (Eds.), LNCS, Springer, Heidelberg, pp. 832-839 (2007) |
|
|
| 13 |
G. Potamianos, C. Neti, J. Luettin, and I. Matthews. “Audio-visual automatic speech recognition: An overview”. In: G. Baily, E. Vatikiotis-Bateson, P. Perrier (Eds.), Issues in visual and audio-visual speech processing, MIT Press, (2004) |
|
|
| 14 |
R. Seymour, D. Stewart, J. Ming. “Comparison of image transformbased features for visual speech recognition in clean and corrupted videos”. EURASIP Journal on Image and Video Processing. (2008), doi:10.1155/2008/810362, 2008 |
|
|
| 15 |
B. Plannerer. “An introduction to speech recognition: A tutorial ”. Germany, 2003 |
|
|
| 16 |
L. Rabiner, B.H. Juang. “Fundamentals of Speech Recognition”’. Prentice Hall, Englewood Cliffs (1993) |
|
|
| 17 |
B. Nasersharif, A. Akbari. “SNR-dependent compression of enhanced Mel sub-band energies for compensation of noise effects on MFCC features”. Pattern Recognition Letters, 28:1320-1326, 2007 |
|
|
| 18 |
T. Chen. “Audiovisual speech processing. Lip reading and lip synchronization”. IEEE Signal Processing Magazine, 18: 9-21, 2001 |
|
|
| 19 |
E. D. Petajan. “Automatic lipreading to enhance speech recognition”. In Proceedings of Global Telecommunications Conference. Atlanta, 1984 |
|
|
| 20 |
P. Arnold, F. Hill. “Bisensory augmentation: A speechreading advantage when speech is clearly audible and intact”. Brit. J. Psychol., 92: 339-355, 2001 |
|
|
| 21 |
A. Q. Summerfield. “Some preliminaries to a comprehensive account of audio-visual speech perception”. In: B. Dodd, R. Campbell (Eds.), Hearing by Eye: The Psychology of Lip-reading. Lawrence Erlbarum, London, pp. 3-51 (1987) |
|
|
| 22 |
C. Benoit, T. Mohamadi, S. D. Kandel. “Effects of phonetic context on audio-visual intelligibility of French”. Journal of Speech and Hearing Research. 37: 1195-1203, 1994 |
|
|
| 23 |
C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, D. Vergyri, J. Sison, A. Mashari, and J. Zhou. “Audio visual speech recognition, Final Workshop 2000 Report”. Center for Language and Speech Processing, Johns Hopkins University, Baltimore, 2000 |
|
|
| 24 |
P. Teissier, J. Robert-Ribes, J. L. Schwartz. “Comparing models for audiovisual fusion in a noisyvowel recognition task”. IEEE Transaction on Speech Audio Processing, 7: 629-642, 1999 |
|
|
| 25 |
C. C. Chibelushi, F. Deravi, J. S. D. Mason. “A review of speech-based bimodal recognition”. IEEE Transactions on Multimedia, 4(1): 23-37, 2002 |
|
|
| 26 |
P.L. Silsbee. “Sensory integration in audiovisual automatic speech recognition”. In Proceedings of the 28th Annual Asilomar Conference on Signals, Systems, and Computers, 1: 561-565, 1994 |
|
|
| 27 |
C. Benot. “The intrinsic bimodality of speech communication and the synthesis of talking faces”. In: M. M. Taylor, F. Nel, D. Bouwhuis (Eds.), The Structure of Multimodal Dialogue II. Amsterdam, Netherlands, pp. 485-502 (2000) |
|
|
| 28 |
G. Potamianos, C. Neti, J. Huang, J.H. Connell, S. Chu, V. Libal, E.Marcheret, N. Hass, J. Jiang. “Towards practical development of audiovisual speech recognition”. In Proceedings of IEEE International Conf. on Acoustic, Speech, and Signal Processing. Canada, 2004 |
|
|
| 29 |
S.W.Foo, L. Dong. “Recognition of Visual Speech Elements Using Hidden Markov Models”. In: Y. C. Chen, L.W. Chang, C.T. Hsu (Eds.), Advances in Multimedia Information Processing-PCM02, LNCS2532. Springer-Verlag Berlin Heidelberg, pp.607-614 (2002) |
|
|
| 30 |
A. Verma, T. Faruquie, C. Neti, S. Basu. “Late integration in audiovisual continuous speech recognition”. In Proceedings of Workshop on Automatic Speech Recognition and Understanding. Keystone, 1999 |
|
|
| 31 |
S. Tamura, K. Iwano, S. Furui. “A stream-weight optimization method for multi-stream HMMs based on likelihood value normalization”. In Proceedings of ICASSP. Philadelphia, 2005 |
|
|
| 32 |
A. Adjoudani, C. Benot. “On the integration of auditory and visual parameters in an HMM-based ASR”. In: D. G. Stork and M. E. Hennecke (Eds.), Speech reading by Humans and Machines: Models, Systems, and Speech Recognition, Technologies and Applications, Springer, Berlin, Germany, pp. 461-472 (1996) |
|
|
| |
|
| |
|
| |
| |
|
| |
|
| |
| |
|
| |
|
| |
|
| Rajavel : Colleagues
|
|
| P.S. Sathidevi : Colleagues
|
|
|
|
|
|
|
|
|
|
|