• Home
  • Browse
    • Current Issue
    • By Issue
    • By Author
    • By Subject
    • Author Index
    • Keyword Index
  • Journal Info
    • About Journal
    • Aims and Scope
    • Editorial Board
    • Publication Ethics
    • Peer Review Process
  • Guide for Authors
  • Submit Manuscript
  • Contact Us
 
  • Login
  • Register
Home Articles List Article Information
  • Save Records
  • |
  • Printable Version
  • |
  • Recommend
  • |
  • How to cite Export to
    RIS EndNote BibTeX APA MLA Harvard Vancouver
  • |
  • Share Share
    CiteULike Mendeley Facebook Google LinkedIn Twitter
The Egyptian Journal of Language Engineering
arrow Articles in Press
arrow Current Issue
Journal Archive
Volume Volume 11 (2024)
Volume Volume 10 (2023)
Volume Volume 9 (2022)
Volume Volume 8 (2021)
Volume Volume 7 (2020)
Issue Issue 2
Issue Issue 1
Volume Volume 6 (2019)
Volume Volume 5 (2018)
Volume Volume 4 (2017)
Volume Volume 3 (2016)
Volume Volume 2 (2015)
Volume Volume 1 (2014)
ElMaghraby, E., Gody, A., Farouk, M. (2020). Noise-Robust Speech Recognition System based on Multimodal Audio-Visual Approach Using Different Deep Learning Classification Techniques. The Egyptian Journal of Language Engineering, 7(1), 27-42. doi: 10.21608/ejle.2020.22022.1002
Eslam E ElMaghraby; Amr M Gody; Mohamed Hashem Farouk. "Noise-Robust Speech Recognition System based on Multimodal Audio-Visual Approach Using Different Deep Learning Classification Techniques". The Egyptian Journal of Language Engineering, 7, 1, 2020, 27-42. doi: 10.21608/ejle.2020.22022.1002
ElMaghraby, E., Gody, A., Farouk, M. (2020). 'Noise-Robust Speech Recognition System based on Multimodal Audio-Visual Approach Using Different Deep Learning Classification Techniques', The Egyptian Journal of Language Engineering, 7(1), pp. 27-42. doi: 10.21608/ejle.2020.22022.1002
ElMaghraby, E., Gody, A., Farouk, M. Noise-Robust Speech Recognition System based on Multimodal Audio-Visual Approach Using Different Deep Learning Classification Techniques. The Egyptian Journal of Language Engineering, 2020; 7(1): 27-42. doi: 10.21608/ejle.2020.22022.1002

Noise-Robust Speech Recognition System based on Multimodal Audio-Visual Approach Using Different Deep Learning Classification Techniques

Article 3, Volume 7, Issue 1, April 2020, Page 27-42  XML PDF (1.5 MB)
Document Type: Original Article
DOI: 10.21608/ejle.2020.22022.1002
View on SCiNiTO View on SCiNiTO
Authors
Eslam E ElMaghraby email orcid 1; Amr M Godyorcid 2; Mohamed Hashem Faroukorcid 3
1egypt- el fayoum el mashtal st
2Faculty of Engineering, Fayoum University
3Engineering Math.; Physics Dept., Faculty of Engineering, Cairo University
Abstract
This paper extends an earlier work on designing a speech recognition system based on Hidden Markov Model (HMM) classification technique of using visual modality in addition to audio modality[1]. Improved off traditional HMM-based Automatic Speech Recognition (ASR) accuracy is achieved by implementing a technique using either RNN-based or CNN-based approach. This research is intending to deliver two contributions: The first contribution is the methodology of choosing the visual features by comparing different visual features extraction methods like Discrete Cosine Transform (DCT), blocked DCT, and Histograms of Oriented Gradients with Local Binary Patterns (HOG+LBP), and applying different dimension reduction techniques like Principal Component Analysis (PCA), auto-encoder, Linear Discriminant Analysis (LDA), t-distributed Stochastic Neighbor Embedding (t-SNE) to find the most effective features vector size. Then the obtained visual features are early integrated with the audio features obtained by using Mel Frequency Cepstral Coefficients (MFCCs) and feed the combined audio-visual feature vector to the classification process. The second contribution of this research is the methodology of developing the classification process using deep learning by comparing different Deep Neural Network (DNN) architectures like Bidirectional Long-Short Term Memory (BiLSTM) and Convolution Neural Network (CNN) with the traditional HMM. The proposed model is evaluated on two multi-speakers AV-ASR datasets named AVletters and GRID with different SNR. The model performs speaker-independent experiments in AVlettter dataset and speaker-dependent in GRID dataset.
Keywords
AV-ASR; DCT; MFCC; HMM; DNN
Statistics
Article View: 501
PDF Download: 708
Home | Glossary | News | Aims and Scope | Sitemap
Top Top

Journal Management System. Designed by NotionWave.