• Home
  • Browse
    • Current Issue
    • By Issue
    • By Author
    • By Subject
    • Author Index
    • Keyword Index
  • Journal Info
    • About Journal
    • Aims and Scope
    • Editorial Board
    • Publication Ethics
    • Peer Review Process
  • Guide for Authors
  • Submit Manuscript
  • Contact Us
 
  • Login
  • Register
Home Articles List Article Information
  • Save Records
  • |
  • Printable Version
  • |
  • Recommend
  • |
  • How to cite Export to
    RIS EndNote BibTeX APA MLA Harvard Vancouver
  • |
  • Share Share
    CiteULike Mendeley Facebook Google LinkedIn Twitter
The Egyptian Journal of Language Engineering
arrow Articles in Press
arrow Current Issue
Journal Archive
Volume Volume 11 (2024)
Volume Volume 10 (2023)
Volume Volume 9 (2022)
Volume Volume 8 (2021)
Volume Volume 7 (2020)
Volume Volume 6 (2019)
Volume Volume 5 (2018)
Volume Volume 4 (2017)
Issue Issue 2
Issue Issue 1
Volume Volume 3 (2016)
Volume Volume 2 (2015)
Volume Volume 1 (2014)
Elmaghraby, E., Gody, A., Farouk, M. (2017). Enhancement Quality and Accuracy of Speech Recognition System Using Multimodal Audio-Visual Speech signal. The Egyptian Journal of Language Engineering, 4(2), 27-40. doi: 10.21608/ejle.2017.59430
Eslam Eid Elmaghraby; Amr Gody; Mohamed Hashem Farouk. "Enhancement Quality and Accuracy of Speech Recognition System Using Multimodal Audio-Visual Speech signal". The Egyptian Journal of Language Engineering, 4, 2, 2017, 27-40. doi: 10.21608/ejle.2017.59430
Elmaghraby, E., Gody, A., Farouk, M. (2017). 'Enhancement Quality and Accuracy of Speech Recognition System Using Multimodal Audio-Visual Speech signal', The Egyptian Journal of Language Engineering, 4(2), pp. 27-40. doi: 10.21608/ejle.2017.59430
Elmaghraby, E., Gody, A., Farouk, M. Enhancement Quality and Accuracy of Speech Recognition System Using Multimodal Audio-Visual Speech signal. The Egyptian Journal of Language Engineering, 2017; 4(2): 27-40. doi: 10.21608/ejle.2017.59430

Enhancement Quality and Accuracy of Speech Recognition System Using Multimodal Audio-Visual Speech signal

Article 3, Volume 4, Issue 2, September 2017, Page 27-40  XML PDF (1.49 MB)
Document Type: Original Article
DOI: 10.21608/ejle.2017.59430
View on SCiNiTO View on SCiNiTO
Authors
Eslam Eid Elmaghraby email orcid 1; Amr Godyorcid 2; Mohamed Hashem Faroukorcid 3
1Communication and Electronics Engineering Department from faculty of engineering, Fayoum University
2Faculty of Engineering, Fayoum University
3Engineering Math. & Physics Dept., Faculty of Engineering, Cairo University
Abstract
Most developments in speech-based automatic recognition have relied on acoustic speech as the sole input signal,
disregarding its visual counterpart. However, recognition based on acoustic speech alone can be afflicted with deficiencies that prevent its use in many real-world applications, particularly under adverse conditions. The combination of auditory and visual modalities promises higher recognition accuracy and robustness than can be obtained with a single modality. Multimodal recognition is therefore acknowledged as a vital component of the next generation of spoken language systems. This paper aims to build a connected-words audio visual speech recognition system (AV-ASR) for English language that uses both acoustic and visual speech information to improve the recognition performance. Initially, Mel frequency cepstral coefficients (MFCCs) have been used to extract the audio features from the speech-files. For the visual counterpart, the Discrete Cosine Transform (DCT) Coefficients have been used to extract the visual feature from the speaker's mouth region and Principle Component Analysis (PCA) have been used for dimensionality reduction purpose. These features are then concatenated with traditional audio ones, and the resulting features are used for training hidden Markov models (HMMs) parameters using word level acoustic models. The system has been developed using hidden Markov model toolkit (HTK) that uses hidden Markov models (HMMs) for recognition. The potential of the suggested approach is demonstrated by a preliminary experiment on the GRID sentence database one of the largest databases
available for audio-visual recognition system, which contains continuous English voice commands for a small vocabulary task. The experimental results show that the proposed Audio Video Speech Recognizer (AV-ASR) system exhibits higher recognition rate in comparison to an audio-only recognizer as well as it indicates robust performance. An increase of success rate by 4% for the grammar based word recognition system overall speakers is achieved for speaker independent test.
Keywords
AV-ASR; HMM; HTK; MFCC; DCT; PCA; MATLAB; GRID
Statistics
Article View: 173
PDF Download: 491
Home | Glossary | News | Aims and Scope | Sitemap
Top Top

Journal Management System. Designed by NotionWave.