A Hybrid Wavelet Scattering and Mel Spectrogram Feature with Deep Convolution Neural Network for Robust Spoken Digit Recognition

##plugins.themes.bootstrap3.article.main##

Irmawan irmawan
Suci Dwijayanti
Bhakti Yudho Suprapto

Keywords

Spoken digit recognition, Deep CNN, Wavelet Time Scattering, MFCC, Biometric

Abstract

Spoken digit recognition (SDR) plays a critical role in biometric authentication and human–computer interaction, yet existing approaches often rely on small datasets, limited feature representations, or architectures prone to overfitting. To address these limitations, this study proposes a robust end-to-end pipeline that integrates Wavelet Time Scattering (WTS), Mel-Frequency Cepstral Coefficients (MFCC), and a 2D Deep Convolutional Neural Network (2D-CNN) to enhance the accuracy and generalization of SDR systems in realistic environments. The Free-Spoken Digit Dataset (FSDD), consisting of 3000 audio samples from speakers with diverse accents, was pre-processed using zero-padding normalization and transformed into high-resolution time–frequency spectrograms via WTS. The proposed CNN architecture, optimized through systematic experimentation on batch size and learning rate, demonstrated stable convergence and superior discriminative capability. Using a learning rate of 0.001 and a batch size of 50, the model achieved the highest performance with 99.2% accuracy, outperforming established methods including SVM, MFCC-LSTM, and Multiple RNN architectures. Comparative evaluations further revealed that the combined WTS–MFCC feature extraction significantly enhances spectral–temporal representation quality, contributing to improved classification precision across all digit classes. These findings demonstrate that the proposed WTS-MFCC-CNN framework not only advances SDR accuracy but also provides a scalable and computationally efficient approach suitable for real-world biometric, financial, and voice-controlled applications. The results highlight the potential of hybrid time–frequency representations integrated with deep architectures to set a new benchmark for robust spoken digit recognition.

References

S. Nasr, M. Quwaider, and R. Qureshi, "Text-independent Speaker Recognition using Deep Neural Networks," in 2021 International Conference on Information Technology (ICIT), 2021, pp. 517-52, doi: 10.1109/ICIT52682.2021.9491705.

A. Boles and P. Rad, "Voice biometrics: Deep learning-based voiceprint authentication system," in 2017 12th System of Systems Engineering Conference (SoSE), 2017, pp. 1-6, doi: 10.1109/SYSOSE.2017.7994971.

A. Irum and A. Salman, "Speaker verification using deep neural networks: A," vol. 9, no. 1, 2019, doi: 10.18178/ijmlc.2019.9.1.760

R. Qureshi, M. Nawaz, F. Y. Khuhawar, N. Tunio, M. J. I. J. o. A. C. S. Uzair, and Applications, "Analysis of ECG signal processing and filtering algorithms," vol. 10, no. 3, 2019, doi: 10.14569/ijacsa.2019.0100370.

R. Qureshi, S. A. R. Rizvi, S. H. A. Musavi, S. Khan, and K. Khurshid, "Performance analysis of adaptive algorithms for removal of low frequency noise from ECG signal," in 2017 International Conference on Innovations in Electrical Engineering and Computational Technologies (ICIEECT), 2017, pp. 1-5, doi: 10.1109/ICIEECT.2017.7916551.

D. Stoyanov et al., Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings. Springer, 2018.

R. Qureshi, M. Uzair, K. Khurshid, and H. J. P. R. Yan, "Hyperspectral document image processing: Applications, challenges and future prospects," vol. 90, pp. 12-22, 2019, doi: 10.1016/j.patcog.2019.01.026.

M. Sajjad and S. J. I. A. Kwon, "Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM," vol. 8, pp. 79861-79875, 2020, doi: 10.1109/ACCESS.2020.2990405.

M. Zhang, M. Diao, and L. J. I. A. Guo, "Convolutional neural networks for automatic cognitive radio waveform recognition," vol. 5, pp. 11074-11082, 2017, doi: 10.1109/ACCESS.2017.2716191.

O. Krestinskaya, I. Dolzhikova, and A. P. James, "Hierarchical temporal memory using memristor networks: A survey," vol. 2, no. 5, pp. 380-395, 2018, doi: 10.1109/TETCI.2018.2838124.

S. Becker, M. Ackermann, S. Lapuschkin, K.-R. Müller, and W. J. a. p. a. Samek, "Interpreting and explaining deep neural networks for classification of audio signals," 2018.

Y. Zhang, M. Pezeshki, P. Brakel, S. Zhang, C. L. Y. Bengio, and A. J. a. p. a. Courville, "Towards end-to-end speech recognition with deep convolutional neural networks," 2017, doi: 10.48550/arXiv.1701.02720.

T. B. Mokgonyane, T. J. Sefara, T. I. Modipa, M. M. Mogale, M. J. Manamela, and P. J. Manamela, "Automatic speaker recognition system based on machine learning algorithms," in 2019 Southern African Universities Power Engineering Conference/Robotics and Mechatronics/Pattern Recognition Association of South Africa (SAUPEC/RobMech/PRASA), 2019, pp. 141-146, doi: 10.1109/RoboMech.2019.8704837.

F. R. rahman Chowdhury, Q. Wang, I. L. Moreno, and L. Wan, "Attention-based models for text-dependent speaker verification," in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5359-5363, doi: 10.1109/ICASSP.2018.8461587,

S. Dey, P. Motlicek, S. Madikeri, and M. J. S. c. Ferras, "Template-matching for text-dependent speaker verification," vol. 88, pp. 96-105, 2017, doi: 10.1016/j.specom.2017.01.009.

W. Feng, N. Guan, Y. Li, X. Zhang, and Z. Luo, "Audio visual speech recognition with multimodal recurrent neural networks," in 2017 International Joint Conference on neural networks (IJCNN), 2017, pp. 681-688, doi: 10.1109/IJCNN.2017.7965918.

Y. Yu, X. Si, C. Hu, and J. J. N. c. Zhang, "A review of recurrent neural networks: LSTM cells and network architectures," vol. 31, no. 7, pp. 1235-1270, 2019, doi: 10.1162/neco_a_01199.

W. Zhang, M. Zhai, Z. Huang, C. Liu, W. Li, and Y. Cao, "Towards end-to-end speech recognition with deep multipath convolutional neural networks," in International Conference on Intelligent Robotics and Applications, 2019, pp. 332-341.

S. Basu, J. Chakraborty, and M. Aftabuddin, "Emotion recognition from speech using convolutional neural network with recurrent neural network architecture," in 2017 2nd International Conference on Communication and Electronics Systems (ICCES), 2017, pp. 333-336, doi: 10.1109/CESYS.2017.8321292.

Z. J. R. F. Jackson, "Free spoken digit dataset (fsdd)," vol. 1, p. 2020, 2016.

S. Otte, P. Rubisch, and M. V. Butz, "Gradient-based learning of compositional dynamics with modular RNNs," in International Conference on Artificial Neural Networks, 2019, pp. 484-496.

F. M. Bayer, A. J. Kozakevicius, and R. J. J. S. P. Cintra, "An iterative wavelet threshold for signal denoising," vol. 162, pp. 10-20, 2019, doi: 10.1016/j.sigpro.2019.04.005.

W. Liu and W. J. I. A. Chen, "Recent advancements in empirical wavelet transform and its applications," vol. 7, pp. 103770-103780, 2019, doi: 10.1109/ACCESS.2019.2930529.

R. V. Sharan and T. J. Moir, "Time-frequency image resizing using interpolation for acoustic event recognition with convolutional neural networks," in 2019 IEEE International Conference on Signals and Systems (ICSigSys), 2019, pp. 8-11, doi: 10.1109/ICSIGSYS.2019.8811088.

K.-L. Chung, T.-C. Leung, T.-Y. Liu, and Y.-C. J. I. A. Tseng, "A Cubic Convolution Interpolation-Based Chroma Subsampling Method for Bayer and RGBW CFA Raw Images," vol. 10, pp. 22687-22699, 2022, doi: 10.1109/ACCESS.2022.3154487.

A. B. Nassif, I. Shahin, I. Attili, M. Azzeh, and K. J. I. a. Shaalan, "Speech recognition using deep neural networks: A systematic review," vol. 7, pp. 19143-19165, 2019, doi: 10.1109/ACCESS.2019.2896880.

Q. Li et al., "MSP-MFCC: Energy-efficient MFCC feature extraction method with mixed-signal processing architecture for wearable speech recognition applications," vol. 8, pp. 48720-48730, 2020, doi: 10.1109/ACCESS.2020.2979799.

N. Naka and V. Ruoppila, "Linear prediction coefficient conversion device and linear prediction coefficient conversion method," ed: Google Patents, 2018, doi: 10.1016/j.patrec.2017.03.004.

E. Tatulli and T. Hueber, "Feature extraction using multimodal convolutional neural networks for visual speech recognition," in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 2971-2975, doi: 10.1109/ICASSP.2017.7952701.

S. Roy, N. Das, M. Kundu, and M. J. P. R. L. Nasipuri, "Handwritten isolated Bangla compound character recognition: A new benchmark using a novel deep learning approach," vol. 90, pp. 15-21, 2017, doi: 10.1016/j.patrec.2017.03.004.

T. J. Jun, H. M. Nguyen, D. Kang, D. Kim, D. Kim, and Y.-H. J. a. p. a. Kim, "ECG arrhythmia classification using a 2-D convolutional neural network," 2018, doi: 10.48550/arXiv.1804.06812.

O. F. Reyes-Galaviz, W. Pedrycz, Z. He, N. J. J. D. Pizzi, and K. Engineering, "A supervised gradient-based learning algorithm for optimized entity resolution," vol. 112, pp. 106-129, 2017, doi: 10.1016/j.datak.2017.10.004.

Y. Wang, X. Tao, X. Qi, X. Shen, and J. J. A. i. n. i. p. s. Jia, "Image inpainting via generative multi-column convolutional neural networks," vol. 31, 2018.

W. Yin et al., "Self-adjustable domain adaptation in personalized ECG monitoring integrated with IR-UWB radar," vol. 47, pp. 75-87, 2019, doi: 10.1016/j.bspc.2018.08.002.

Y. F. Utomo, E. C. Djamal, F. Nugraha, and F. Renaldi, "Spoken word and speaker recognition using MFCC and multiple recurrent neural networks," in 2020 7th international conference on electrical engineering, computer sciences and informatics (EECSI), 2020, pp. 192-197, doi: 10.23919/EECSI50503.2020.9251870.

M. Jain, S. Narayan, P. Balaji, A. Bhowmick, and R. K. J. a. p. a. Muthu, "Speech emotion recognition using support vector machine," 2020, doi: 10.48550/arXiv.2002.07590.

T. Zia and U. J. I. J. o. S. T. Zahid, "Long short-term memory recurrent neural network architectures for Urdu acoustic modeling," vol. 22, no. 1, pp. 21-30, 2019, doi: 10.48550/arXiv.1402.1128.

R. V. Sharan, "Spoken digit recognition using wavelet scalogram and convolutional neural networks," in 2020 IEEE Recent Advances in Intelligent Computational Systems (RAICS), 2020, pp. 101-105: IEEE, doi: 10.1109/RAICS51191.2020.9332505.

H. Ba, "Spoken Digit Classification: A Method Using Convolutional Neural Network and Mixed Feature.", doi: 10.18178/wcse.2021.02.002.

A. S. M. B. Wazir and J. H. Chuah, "Spoken arabic digits recognition using deep learning," in 2019 IEEE International Conference on Automatic Control and Intelligent Systems (I2CACIS), 2019, pp. 339-344: IEEE, doi: 10.1109/I2CACIS.2019.8825004.

Most read articles by the same author(s)

Similar Articles

1 2 3 4 5 6 7 8 9 10 > >> 

You may also start an advanced similarity search for this article.